Hosts in my Proxmox cluster are randomly rebooting.

berkaybulut · Sunday at 22:24

Hello,
I have a 4 machine proxmox cluster.

I use ceph as storage.

Some of the hosts in my cluster are randomly rebooting. Sometimes only 1, sometimes 4 randomly rebooting. I couldn't find much in the syslog logs. I need help.

All hosts use the same version.

proxmox-ve: 8.3.0 (running kernel: 6.8.12-8-pve)
pve-manager: 8.3.3 (running version: 8.3.3/f157a38b211595d6)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
proxmox-kernel-6.8: 6.8.12-8
proxmox-kernel-6.8.12-8-pve-signed: 6.8.12-8
proxmox-kernel-6.8.12-7-pve-signed: 6.8.12-7
proxmox-kernel-6.8.12-6-pve-signed: 6.8.12-6
proxmox-kernel-6.8.12-5-pve-signed: 6.8.12-5
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
proxmox-kernel-6.8.12-3-pve-signed: 6.8.12-3
proxmox-kernel-6.8.12-2-pve-signed: 6.8.12-2
proxmox-kernel-6.8.12-1-pve-signed: 6.8.12-1
proxmox-kernel-6.8.8-4-pve-signed: 6.8.8-4
proxmox-kernel-6.8.8-3-pve-signed: 6.8.8-3
proxmox-kernel-6.8.8-2-pve-signed: 6.8.8-2
proxmox-kernel-6.8.8-1-pve-signed: 6.8.8-1
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
proxmox-kernel-6.5.13-6-pve-signed: 6.5.13-6
proxmox-kernel-6.5: 6.5.13-6
proxmox-kernel-6.5.13-5-pve-signed: 6.5.13-5
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph: 18.2.4-pve3
ceph-fuse: 18.2.4-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.1
libpve-storage-perl: 8.3.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
openvswitch-switch: 3.1.0-2+deb12u1
proxmox-backup-client: 3.3.2-1
proxmox-backup-file-restore: 3.3.2-2
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.3.4
pve-cluster: 8.0.10
pve-container: 5.2.3
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-3
pve-ha-manager: 4.0.6
pve-i18n: 3.3.3
pve-qemu-kvm: 9.0.2-5
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve1

berkaybulut · Sunday at 22:27

also syslog

Feb 02 22:12:25 cmt6770 pmxcfs[1933]: [status] notice: received log
Feb 02 22:12:27 cmt6770 pmxcfs[1933]: [status] notice: received log
Feb 02 22:17:01 cmt6770 CRON[2095014]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Feb 02 22:17:01 cmt6770 CRON[2095015]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Feb 02 22:17:01 cmt6770 CRON[2095014]: pam_unix(cron:session): session closed for user root
Feb 02 22:17:47 cmt6770 pvestatd[3616116]: status update time (6.593 seconds)
Feb 02 22:19:26 cmt6770 pveproxy[1955192]: Clearing outdated entries from certificate cache
Feb 02 22:19:30 cmt6770 pmxcfs[1933]: [status] notice: received log
Feb 02 22:19:47 cmt6770 pvestatd[3616116]: status update time (5.283 seconds)
Feb 02 22:23:01 cmt6770 pvedaemon[1023275]: writing cluster log failed: ipcc_send_rec[7] failed: Invalid argument
Feb 02 22:30:06 cmt6770 pvestatd[3616116]: status update time (5.456 seconds)
Feb 02 22:31:54 cmt6770 pvedaemon[1007837]: <root@pam> successful auth for user 'root@pam'
Feb 02 22:31:59 cmt6770 pvedaemon[1023275]: writing cluster log failed: ipcc_send_rec[7] failed: Invalid argument
Feb 02 22:32:37 cmt6770 pvedaemon[1023275]: writing cluster log failed: ipcc_send_rec[7] failed: Invalid argument
Feb 02 22:33:57 cmt6770 pvestatd[3616116]: status update time (5.391 seconds)
Feb 02 22:36:16 cmt6770 pvestatd[3616116]: status update time (5.102 seconds)
Feb 02 22:40:07 cmt6770 pvestatd[3616116]: status update time (5.491 seconds)
Feb 02 22:40:24 cmt6770 corosync[1997]: [TOTEM ] Retransmit List: b2c74
Feb 02 22:41:06 cmt6770 pvestatd[3616116]: status update time (5.006 seconds)
Feb 02 22:42:20 cmt6770 pmxcfs[1933]: [status] notice: received log
Feb 02 22:42:21 cmt6770 pmxcfs[1933]: [status] notice: received log
Feb 02 22:43:06 cmt6770 pvestatd[3616116]: status update time (5.185 seconds)
Feb 02 22:44:19 cmt6770 pvedaemon[1023275]: writing cluster log failed: ipcc_send_rec[7] failed: Invalid argument
Feb 02 22:44:19 cmt6770 pvedaemon[956283]: <root@pam> successful auth for user 'root@pam'
Feb 02 22:45:09 cmt6770 pveproxy[2048403]: proxy detected vanished client connection
Feb 02 22:46:26 cmt6770 pvestatd[3616116]: status update time (5.007 seconds)
Feb 02 22:46:33 cmt6770 pvedaemon[1023275]: writing cluster log failed: ipcc_send_rec[7] failed: Invalid argument
Feb 02 22:47:56 cmt6770 pmxcfs[1933]: [dcdb] notice: data verification successful
Feb 02 22:48:25 cmt6770 pvestatd[3616116]: got timeout
Feb 02 22:48:27 cmt6770 pvestatd[3616116]: status update time (6.278 seconds)
Feb 02 22:48:35 cmt6770 pvestatd[3616116]: got timeout
Feb 02 22:48:35 cmt6770 corosync[1997]: [KNET ] link: host: 4 link: 0 is down
Feb 02 22:48:35 cmt6770 corosync[1997]: [KNET ] link: host: 4 link: 1 is down
Feb 02 22:48:35 cmt6770 corosync[1997]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Feb 02 22:48:35 cmt6770 corosync[1997]: [KNET ] host: host: 4 has no active links
Feb 02 22:48:35 cmt6770 corosync[1997]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Feb 02 22:48:35 cmt6770 corosync[1997]: [KNET ] host: host: 4 has no active links
Feb 02 22:48:40 cmt6770 corosync[1997]: [TOTEM ] Token has not been received in 7725 ms
Feb 02 22:48:40 cmt6770 pvestatd[3616116]: got timeout
Feb 02 22:48:42 cmt6770 corosync[1997]: [TOTEM ] A processor failed, forming new configuration: token timed out (10300ms), waiting 12360ms for consensus.
Feb 02 22:48:42 cmt6770 pvestatd[3616116]: status update time (11.307 seconds)
Feb 02 22:48:46 cmt6770 kernel: libceph: mon2 (1)10.0.10.9:6789 session established
Feb 02 22:48:48 cmt6770 corosync[1997]: [KNET ] link: host: 3 link: 0 is down
Feb 02 22:48:48 cmt6770 corosync[1997]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
Feb 02 22:48:54 cmt6770 corosync[1997]: [KNET ] rx: host: 3 link: 0 is up
Feb 02 22:48:54 cmt6770 corosync[1997]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Feb 02 22:48:54 cmt6770 corosync[1997]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 02 22:48:54 cmt6770 corosync[1997]: [KNET ] pmtud: Global data MTU changed to: 1397
Feb 02 22:48:58 cmt6770 corosync[1997]: [QUORUM] Sync members[3]: 1 2 3
Feb 02 22:48:58 cmt6770 corosync[1997]: [QUORUM] Sync left[1]: 4
Feb 02 22:48:58 cmt6770 corosync[1997]: [TOTEM ] A new membership (1.7fbe) was formed. Members left: 4
Feb 02 22:48:58 cmt6770 corosync[1997]: [TOTEM ] Failed to receive the leave message. failed: 4
Feb 02 22:49:02 cmt6770 corosync[1997]: [KNET ] link: host: 3 link: 0 is down
Feb 02 22:49:02 cmt6770 corosync[1997]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
Feb 02 22:49:04 cmt6770 pmxcfs[1933]: [dcdb] notice: cpg_send_message retry 10
Feb 02 22:49:05 cmt6770 pmxcfs[1933]: [dcdb] notice: cpg_send_message retry 20
Feb 02 22:49:06 cmt6770 corosync[1997]: [TOTEM ] Retransmit List: 3
Feb 02 22:49:06 cmt6770 corosync[1997]: [TOTEM ] Retransmit List: 4 5
Feb 02 22:49:06 cmt6770 pmxcfs[1933]: [dcdb] notice: cpg_send_message retry 30
Feb 02 22:49:06 cmt6770 pmxcfs[1933]: [dcdb] notice: members: 1/1933, 2/2572, 3/3196
Feb 02 22:49:06 cmt6770 pmxcfs[1933]: [dcdb] notice: starting data syncronisation
Feb 02 22:49:07 cmt6770 pmxcfs[1933]: [dcdb] notice: cpg_send_message retry 40
Feb 02 22:49:07 cmt6770 pmxcfs[1933]: [dcdb] notice: cpg_send_message retry 10
Feb 02 22:49:08 cmt6770 pmxcfs[1933]: [dcdb] notice: cpg_send_message retry 50
Feb 02 22:49:08 cmt6770 pmxcfs[1933]: [dcdb] notice: cpg_send_message retry 20
Feb 02 22:49:09 cmt6770 pmxcfs[1933]: [dcdb] notice: cpg_send_message retry 60
Feb 02 22:49:09 cmt6770 pmxcfs[1933]: [dcdb] notice: cpg_send_message retry 30
Feb 02 22:49:10 cmt6770 corosync[1997]: [KNET ] rx: host: 3 link: 0 is up
Feb 02 22:49:10 cmt6770 corosync[1997]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Feb 02 22:49:10 cmt6770 corosync[1997]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 02 22:49:10 cmt6770 pmxcfs[1933]: [dcdb] notice: cpg_send_message retry 70
Feb 02 22:49:10 cmt6770 pmxcfs[1933]: [dcdb] notice: cpg_send_message retry 40
Feb 02 22:49:10 cmt6770 corosync[1997]: [KNET ] pmtud: Global data MTU changed to: 1397
Feb 02 22:49:11 cmt6770 pmxcfs[1933]: [dcdb] notice: cpg_send_message retry 80
Feb 02 22:49:11 cmt6770 pmxcfs[1933]: [dcdb] notice: cpg_send_message retry 50
Feb 02 22:49:11 cmt6770 corosync[1997]: [QUORUM] Members[3]: 1 2 3
Feb 02 22:49:11 cmt6770 corosync[1997]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 02 22:49:11 cmt6770 pmxcfs[1933]: [dcdb] notice: cpg_send_message retried 83 times
Feb 02 22:49:11 cmt6770 pmxcfs[1933]: [dcdb] notice: cpg_send_message retried 53 times
Feb 02 22:49:11 cmt6770 pmxcfs[1933]: [status] notice: members: 1/1933, 2/2572, 3/3196
Feb 02 22:49:11 cmt6770 pmxcfs[1933]: [status] notice: starting data syncronisation
Feb 02 22:49:23 cmt6770 corosync[1997]: [KNET ] link: host: 3 link: 0 is down
Feb 02 22:49:23 cmt6770 corosync[1997]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
Feb 02 22:49:24 cmt6770 ceph-osd[2792]: 2025-02-02T22:49:24.091+0300 70eb8b6006c0 -1 osd.2 41888 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.18334602.0:649374 7.77 7:ee414baa:::rbd_data.83d8ca5619e018.0000000000001300:head [write 360448~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41882)
Feb 02 22:49:24 cmt6770 corosync[1997]: [TOTEM ] Retransmit List: e 10 11 12 13 14 15 17
Feb 02 22:49:24 cmt6770 corosync[1997]: [TOTEM ] Retransmit List: 10 11 12 13 14 15 17 19 1a
Feb 02 22:49:24 cmt6770 watchdog-mux[1630]: client watchdog expired - disable watchdog updates
ondisk+write+known_if_redirected+supports_pool_eio e41884)
Feb 02 22:49:28 cmt6770 corosync[1997]: [KNET ] link: host: 3 link: 1 is down
Feb 02 22:49:28 cmt6770 corosync[1997]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
Feb 02 22:49:28 cmt6770 corosync[1997]: [KNET ] host: host: 3 has no active links
Feb 02 22:49:29 cmt6770 corosync[1997]: [KNET ] link: Resetting MTU for link 1 because host 3 joined
Feb 02 22:49:29 cmt6770 corosync[1997]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
Feb 02 22:49:29 cmt6770 corosync[1997]: [TOTEM ] Retransmit List: 11 12 13 17 19 1a 1b 1c 1d 20 21 22 23 24 25 26 27
Feb 02 22:49:29 cmt6770 corosync[1997]: [TOTEM ] Retransmit List: 11 12 13 17 19 1b 1d 20 22 23 24 25 26 27
Feb 02 22:49:29 cmt6770 pmxcfs[1933]: [dcdb] notice: received all states
Feb 02 22:49:29 cmt6770 pmxcfs[1933]: [dcdb] notice: leader is 1/1933
Feb 02 22:49:29 cmt6770 pmxcfs[1933]: [dcdb] notice: synced members: 1/1933, 2/2572, 3/3196
Feb 02 22:49:29 cmt6770 pmxcfs[1933]: [dcdb] notice: start sending inode updates
Feb 02 22:49:29 cmt6770 pmxcfs[1933]: [dcdb] notice: sent all (0) updates
Feb 02 22:49:29 cmt6770 pmxcfs[1933]: [dcdb] notice: all data is up to date
Feb 02 22:49:29 cmt6770 pmxcfs[1933]: [dcdb] notice: dfsm_deliver_queue: queue length 11
Feb 02 22:49:29 cmt6770 watchdog-mux[1630]: exit watchdog-mux with active connections
Feb 02 22:49:29 cmt6770 pmxcfs[1933]: [status] notice: received all states
Feb 02 22:49:29 cmt6770 corosync[1997]: [TOTEM ] Retransmit List: 19 1b 1d 20 22 23 26
Feb 02 22:49:29 cmt6770 pvescheduler[2122414]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Feb 02 22:49:29 cmt6770 pmxcfs[1933]: [status] notice: all data is up to date
Feb 02 22:49:29 cmt6770 pmxcfs[1933]: [status] notice: dfsm_deliver_queue: queue length 1889
Feb 02 22:49:29 cmt6770 systemd-journald[1033]: Received client request to sync journal.
Feb 02 22:49:29 cmt6770 kernel: watchdog: watchdog0: watchdog did not stop!
Feb 02 22:49:29 cmt6770 systemd[1]: watchdog-mux.service: Deactivated successfully.
Feb 02 22:49:29 cmt6770 systemd[1]: watchdog-mux.service: Consumed 41.545s CPU time.
Feb 02 22:49:29 cmt6770 pvescheduler[2122415]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Feb 02 22:49:30 cmt6770 ceph-osd[2792]: 2025-02-02T22:49:30.117+0300 70eb8b6006c0 -1 osd.2 41888 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.18268111.0:214731 7.e 7:705055a7:::rbd_data.e3ee491242a87e.0000000000000266:head [write 1777664~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41882)
Feb 02 22:49:31 cmt6770 ceph-osd[2792]: 2025-02-02T22:49:31.086+0300 70eb8b6006c0 -1 osd.2 41888 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.18268111.0:214731 7.e 7:705055a7:::rbd_data.e3ee491242a87e.0000000000000266:head [write 1777664~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41882)
Feb 02 22:49:31 cmt6770 ceph-osd[2791]: 2025-02-02T22:49:31.669+0300 7a3eb02006c0 -1 osd.18 41888 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.17449834.0:1181107 7.0 7:00003b05:::rbd_data.0a16eb7300114b.00000000000003e0:head [write 2023424~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41887)
Feb 02 22:49:32 cmt6770 ceph-osd[2792]: 2025-02-02T22:49:32.080+0300 70eb8b6006c0 -1 osd.2 41888 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.18268111.0:214731 7.e 7:705055a7:::rbd_data.e3ee491242a87e.0000000000000266:head [write 1777664~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41882)
Feb 02 22:49:32 cmt6770 corosync[1997]: [TOTEM ] Retransmit List: 1d 20 22 23 26
Feb 02 22:49:32 cmt6770 pve-ha-lrm[3789583]: loop take too long (61 seconds)
Feb 02 22:49:32 cmt6770 corosync[1997]: [TOTEM ] Retransmit List: 1d 20 22 23 26 32 33
Feb 02 22:49:32 cmt6770 pve-ha-lrm[3789583]: watchdog update failed - Broken pipe
Feb 02 22:49:32 cmt6770 ceph-osd[2791]: 2025-02-02T22:49:32.692+0300 7a3eb02006c0 -1 osd.18 41888 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.17449834.0:1181107 7.0 7:00003b05:::rbd_data.0a16eb7300114b.00000000000003e0:head [write 2023424~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41887)
Feb 02 22:49:33 cmt6770 ceph-osd[2791]: 2025-02-02T22:49:33.666+0300 7a3eb02006c0 -1 osd.18 41888 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.17449834.0:1181107 7.0 7:00003b05:::rbd_data.0a16eb7300114b.00000000000003e0:head [write 2023424~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41887)
Feb 02 22:49:33 cmt6770 corosync[1997]: [KNET ] rx: host: 3 link: 0 is up
Feb 02 22:49:33 cmt6770 corosync[1997]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Feb 02 22:49:33 cmt6770 corosync[1997]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 02 22:49:34 cmt6770 ceph-osd[2792]: 2025-02-02T22:49:34.120+0300 70eb8b6006c0 -1 osd.2 41888 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.18039030.0:938471 7.16f 7:f6cb3700:::rbd_data.ed023169b04d88.00000000000014f3:head [write 3670016~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41882)
Feb 02 22:49:34 cmt6770 ceph-osd[2791]: 2025-02-02T22:49:34.684+0300 7a3eb02006c0 -1 osd.18 41888 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.17449834.0:1181107 7.0 7:00003b05:::rbd_data.0a16eb7300114b.00000000000003e0:head [write 2023424~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41887)
Feb 02 22:49:35 cmt6770 ceph-osd[2792]: 2025-02-02T22:49:35.087+0300 70eb8b6006c0 -1 osd.2 41888 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.18039030.0:938471 7.16f 7:f6cb3700:::rbd_data.ed023169b04d88.00000000000014f3:head [write 3670016~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41882)
Feb 02 22:49:35 cmt6770 corosync[1997]: [TOTEM ] Retransmit List: 22 23 26 32 33
Feb 02 22:49:35 cmt6770 ceph-osd[2791]: 2025-02-02T22:49:35.729+0300 7a3eb02006c0 -1 osd.18 41888 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.17449834.0:1181107 7.0 7:00003b05:::rbd_data.0a16eb7300114b.00000000000003e0:head [write 2023424~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41887)
Feb 02 22:49:36 cmt6770 ceph-osd[2792]: 2025-02-02T22:49:36.055+0300 70eb8b6006c0 -1 osd.2 41888 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.18039030.0:938471 7.16f 7:f6cb3700:::rbd_data.ed023169b04d88.00000000000014f3:head [write 3670016~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e41882)
-- Reboot --
Feb 02 22:51:24 cmt6770 kernel: Linux version 6.8.12-8-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-8 (2025-01-24T12:32Z) ()
Feb 02 22:51:24 cmt6770 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-8-pve root=/dev/mapper/pve-root ro quiet

gfngfn256 · Sunday at 22:41

berkaybulut said:
4 machine proxmox cluster

Probably split-brain-syndrome in an even cluster.

waltar · Sunday at 22:42

berkaybulut said:
Feb 02 22:48:35 cmt6770 corosync[1997]: [KNET ] link: host: 4 link: 0 is down
Feb 02 22:48:35 cmt6770 corosync[1997]: [KNET ] link: host: 4 link: 1 is down
Feb 02 22:48:35 cmt6770 corosync[1997]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Feb 02 22:48:35 cmt6770 corosync[1997]: [KNET ] host: host: 4 has no active links
Feb 02 22:48:35 cmt6770 corosync[1997]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Feb 02 22:48:35 cmt6770 corosync[1997]: [KNET ] host: host: 4 has no active links

berkaybulut said:
Feb 02 22:49:28 cmt6770 corosync[1997]: [KNET ] link: host: 3 link: 1 is down
Feb 02 22:49:28 cmt6770 corosync[1997]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
Feb 02 22:49:28 cmt6770 corosync[1997]: [KNET ] host: host: 3 has no active links

berkaybulut said:
Feb 02 22:49:32 cmt6770 pve-ha-lrm[3789583]: loop take too long (61 seconds)

Maybe network, cable, switch problem and so the nodes are fencing (=rebooting) itself in the hope it would resolve the problem (by reinit all) which is expected bahavior.

Search

Search

Hosts in my Proxmox cluster are randomly rebooting.

berkaybulut

New Member

berkaybulut

New Member

gfngfn256

Famous Member

waltar

Renowned Member

We value your privacy