Seemingly random system hangs due to corosync/pmxcfs??

Mar 8, 2022
49
9
13
37
These are some of the the last logs before the system hangs/becomes completely unresponsive. I have to powerfail to get it back online.

From what I gather corosync reports host 2 link seems to be going down repeatedly. What else should I be looking at to continue troubleshooting?

Code:
Aug 26 03:50:02 pm corosync[7436]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 26 03:50:06 pm corosync[7436]:   [QUORUM] Sync members[2]: 1 3
Aug 26 03:50:06 pm corosync[7436]:   [TOTEM ] A new membership (1.73b6) was formed. Members
Aug 26 03:50:07 pm corosync[7436]:   [KNET  ] link: host: 2 link: 0 is down
Aug 26 03:50:07 pm corosync[7436]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 26 03:50:07 pm corosync[7436]:   [KNET  ] host: host: 2 has no active links
Aug 26 03:50:09 pm pmxcfs[7284]: [dcdb] notice: cpg_send_message retry 10
Aug 26 03:50:10 pm pmxcfs[7284]: [dcdb] notice: cpg_send_message retry 20
Aug 26 03:50:10 pm corosync[7436]:   [TOTEM ] Token has not been received in 4527 ms
Aug 26 03:50:11 pm pmxcfs[7284]: [dcdb] notice: cpg_send_message retry 30
Aug 26 03:50:11 pm corosync[7436]:   [KNET  ] rx: host: 2 link: 0 is up
Aug 26 03:50:11 pm corosync[7436]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Aug 26 03:50:11 pm corosync[7436]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 26 03:50:11 pm corosync[7436]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Aug 26 03:50:11 pm pvestatd[7566]: proxmox-backup-client failed: Error: error trying to connect: error connecting to https://pbs001.tuxis.nl:8007/ - dns error: failed to lookup address information: Temporary failure in name resolution
Aug 26 03:50:12 pm pmxcfs[7284]: [dcdb] notice: cpg_send_message retry 40
Aug 26 03:50:13 pm pmxcfs[7284]: [dcdb] notice: cpg_send_message retry 50
Aug 26 03:50:13 pm pmxcfs[7284]: [status] notice: cpg_send_message retry 10
Aug 26 03:50:14 pm pmxcfs[7284]: [dcdb] notice: cpg_send_message retry 60
Aug 26 03:50:14 pm pmxcfs[7284]: [status] notice: cpg_send_message retry 20
Aug 26 03:50:15 pm pmxcfs[7284]: [dcdb] notice: cpg_send_message retry 70
Aug 26 03:50:15 pm pmxcfs[7284]: [status] notice: cpg_send_message retry 30
Aug 26 03:50:16 pm pmxcfs[7284]: [dcdb] notice: cpg_send_message retry 80
Aug 26 03:50:16 pm pmxcfs[7284]: [status] notice: cpg_send_message retry 40
Aug 26 03:50:16 pm corosync[7436]:   [KNET  ] link: host: 2 link: 0 is down
Aug 26 03:50:16 pm corosync[7436]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 26 03:50:16 pm corosync[7436]:   [KNET  ] host: host: 2 has no active links
Aug 26 03:50:17 pm pmxcfs[7284]: [dcdb] notice: cpg_send_message retry 90
Aug 26 03:50:17 pm pmxcfs[7284]: [status] notice: cpg_send_message retry 50
Aug 26 03:50:17 pm corosync[7436]:   [TOTEM ] Token has not been received in 2737 ms
Aug 26 03:50:18 pm pmxcfs[7284]: [dcdb] notice: cpg_send_message retry 100
Aug 26 03:50:18 pm pmxcfs[7284]: [dcdb] notice: cpg_send_message retried 100 times
Aug 26 03:50:18 pm pmxcfs[7284]: [dcdb] crit: cpg_send_message failed: 6
Aug 26 03:50:18 pm pmxcfs[7284]: [status] notice: cpg_send_message retry 60
Aug 26 03:50:19 pm pmxcfs[7284]: [dcdb] notice: cpg_send_message retry 10
Aug 26 03:50:19 pm pmxcfs[7284]: [status] notice: cpg_send_message retry 70
Aug 26 03:50:19 pm corosync[7436]:   [KNET  ] rx: host: 2 link: 0 is up
Aug 26 03:50:19 pm corosync[7436]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Aug 26 03:50:19 pm corosync[7436]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 26 03:50:19 pm corosync[7436]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Aug 26 03:50:20 pm pmxcfs[7284]: [dcdb] notice: cpg_send_message retry 20
Aug 26 03:50:20 pm pmxcfs[7284]: [status] notice: cpg_send_message retry 80
Aug 26 03:50:21 pm pmxcfs[7284]: [dcdb] notice: cpg_send_message retry 30
Aug 26 03:50:21 pm pmxcfs[7284]: [status] notice: cpg_send_message retry 90
Aug 26 03:50:22 pm pmxcfs[7284]: [dcdb] notice: cpg_send_message retry 40
Aug 26 03:50:22 pm pmxcfs[7284]: [status] notice: cpg_send_message retry 100
Aug 26 03:50:22 pm pmxcfs[7284]: [status] notice: cpg_send_message retried 100 times
Aug 26 03:50:22 pm pmxcfs[7284]: [status] crit: cpg_send_message failed: 6
Aug 26 03:50:22 pm corosync[7436]:   [TOTEM ] Token has not been received in 7388 ms
Aug 26 03:50:23 pm pmxcfs[7284]: [dcdb] notice: cpg_send_message retry 50
Aug 26 03:50:23 pm pmxcfs[7284]: [status] notice: cpg_send_message retry 10
Aug 26 03:50:24 pm pmxcfs[7284]: [dcdb] notice: cpg_send_message retry 60
Aug 26 03:50:24 pm pmxcfs[7284]: [status] notice: cpg_send_message retry 20
Aug 26 03:50:25 pm pmxcfs[7284]: [dcdb] notice: cpg_send_message retry 70
Aug 26 03:50:25 pm pmxcfs[7284]: [status] notice: cpg_send_message retry 30
Aug 26 03:50:26 pm pmxcfs[7284]: [dcdb] notice: cpg_send_message retry 80
Aug 26 03:50:26 pm pmxcfs[7284]: [status] notice: cpg_send_message retry 40
Aug 26 03:50:26 pm corosync[7436]:   [TOTEM ] Token has not been received in 11989 ms
Aug 26 03:50:27 pm pmxcfs[7284]: [dcdb] notice: cpg_send_message retry 90
Aug 26 03:50:27 pm pmxcfs[7284]: [status] notice: cpg_send_message retry 50
Aug 26 03:50:27 pm corosync[7436]:   [KNET  ] link: host: 2 link: 0 is down
Aug 26 03:50:27 pm corosync[7436]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 26 03:50:27 pm corosync[7436]:   [KNET  ] host: host: 2 has no active links
Aug 26 03:50:27 pm corosync[7436]:   [QUORUM] Sync members[2]: 1 3

Code:
pm:~# pveversion -v
proxmox-ve: 7.4-1 (running kernel: 5.15.111-1-pve)
pve-manager: 7.4-16 (running version: 7.4-16/0f39f621)
pve-kernel-5.15: 7.4-5
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.111-1-pve: 5.15.111-1
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.13.19-6-pve: 5.13.19-15
ceph: 17.2.6-pve1
ceph-fuse: 17.2.6-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.3-1
proxmox-backup-file-restore: 2.4.3-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.2
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-5
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1

Code:
root@pm2:~# pveversion -v
proxmox-ve: 7.4-1 (running kernel: 5.15.111-1-pve)
pve-manager: 7.4-16 (running version: 7.4-16/0f39f621)
pve-kernel-5.15: 7.4-5
pve-kernel-5.15.111-1-pve: 5.15.111-1
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.107-1-pve: 5.15.107-1
pve-kernel-5.15.102-1-pve: 5.15.102-1
pve-kernel-5.15.85-1-pve: 5.15.85-1
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
ceph: 17.2.6-pve1
ceph-fuse: 17.2.6-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.3-1
proxmox-backup-file-restore: 2.4.3-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-5
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1

Code:
root@pm3:~# pveversion -v
proxmox-ve: 7.4-1 (running kernel: 5.15.111-1-pve)
pve-manager: 7.4-16 (running version: 7.4-16/0f39f621)
pve-kernel-5.15: 7.4-5
pve-kernel-5.15.111-1-pve: 5.15.111-1
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph: 17.2.6-pve1
ceph-fuse: 17.2.6-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.3-1
proxmox-backup-file-restore: 2.4.3-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-5
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
 
Happened again this morning. Here is a snip from journalctl before the system became unresponsive. I also attached a txt document containing logs that go a bit further back. Any guidance as to what I can look at on the system more closely would be appreciated. If there are any other logs or configurations I need to share in order to help please let me know. Thank you!!!


Code:
Aug 31 05:40:57 pm ceph-osd[7604]: 2023-08-31T05:40:57.550-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3.e5ece409 (undecoded) ondisk+write+known_if_redirected e2153)
Aug 31 05:40:58 pm ceph-osd[7604]: 2023-08-31T05:40:58.598-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3.e5ece409 (undecoded) ondisk+write+known_if_redirected e2153)
Aug 31 05:40:59 pm ceph-osd[7604]: 2023-08-31T05:40:59.562-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3.e5ece409 (undecoded) ondisk+write+known_if_redirected e2153)
Aug 31 05:41:00 pm corosync[7437]:   [TOTEM ] Retransmit List: 1 2 3
Aug 31 05:41:00 pm corosync[7437]:   [TOTEM ] Retransmit List: 1 2 3
Aug 31 05:41:00 pm ceph-osd[7604]: 2023-08-31T05:41:00.602-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3.e5ece409 (undecoded) ondisk+write+known_if_redirected e2153)
Aug 31 05:41:01 pm corosync[7437]:   [TOTEM ] Retransmit List: 1 2
Aug 31 05:41:01 pm ceph-osd[7604]: 2023-08-31T05:41:01.558-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3.e5ece409 (undecoded) ondisk+write+known_if_redirected e2153)
Aug 31 05:41:01 pm pve-ha-lrm[548963]: Task 'UPID:pm:00086064:003907AA:64F05FD8:vzstart:111:root@pam:' still active, waiting
Aug 31 05:41:02 pm ceph-osd[7604]: 2023-08-31T05:41:02.538-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3.e5ece409 (undecoded) ondisk+write+known_if_redirected e2153)
Aug 31 05:41:03 pm ceph-osd[7604]: 2023-08-31T05:41:03.550-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3.e5ece409 (undecoded) ondisk+write+known_if_redirected e2153)
Aug 31 05:41:03 pm corosync[7437]:   [TOTEM ] Token has not been received in 2737 ms
Aug 31 05:41:04 pm ceph-osd[7604]: 2023-08-31T05:41:04.530-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3.e5ece409 (undecoded) ondisk+write+known_if_redirected e2153)
Aug 31 05:41:05 pm ceph-osd[7604]: 2023-08-31T05:41:05.490-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3.e5ece409 (undecoded) ondisk+write+known_if_redirected e2153)
Aug 31 05:41:05 pm corosync[7437]:   [KNET  ] link: host: 2 link: 0 is down
Aug 31 05:41:05 pm corosync[7437]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 31 05:41:05 pm corosync[7437]:   [KNET  ] host: host: 2 has no active links
Aug 31 05:41:06 pm ceph-osd[7604]: 2023-08-31T05:41:06.494-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3.e5ece409 (undecoded) ondisk+write+known_if_redirected e2153)
Aug 31 05:41:06 pm pve-ha-lrm[548963]: Task 'UPID:pm:00086064:003907AA:64F05FD8:vzstart:111:root@pam:' still active, waiting
Aug 31 05:41:07 pm ceph-osd[7604]: 2023-08-31T05:41:07.482-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3.e5ece409 (undecoded) ondisk+write+known_if_redirected e2153)
Aug 31 05:41:07 pm corosync[7437]:   [KNET  ] rx: host: 2 link: 0 is up
Aug 31 05:41:07 pm corosync[7437]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Aug 31 05:41:07 pm corosync[7437]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 31 05:41:08 pm ceph-osd[7604]: 2023-08-31T05:41:08.482-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3.e5ece409 (undecoded) ondisk+write+known_if_redirected e2153)
Aug 31 05:41:09 pm ceph-osd[7604]: 2023-08-31T05:41:09.526-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3.e5ece409 (undecoded) ondisk+write+known_if_redirected e2153)
Aug 31 05:41:10 pm ceph-osd[7604]: 2023-08-31T05:41:10.494-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 9 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3.e5ece409 (undecoded) ondisk+write+known_if_redirected e2153)
Aug 31 05:41:11 pm ceph-osd[7604]: 2023-08-31T05:41:11.498-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 9 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3.e5ece409 (undecoded) ondisk+write+known_if_redirected e2153)
Aug 31 05:41:11 pm pve-ha-lrm[548963]: Task 'UPID:pm:00086064:003907AA:64F05FD8:vzstart:111:root@pam:' still active, waiting
Aug 31 05:41:12 pm ceph-osd[7604]: 2023-08-31T05:41:12.542-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 9 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3.e5ece409 (undecoded) ondisk+write+known_if_redirected e2153)
Aug 31 05:41:12 pm corosync[7437]:   [TOTEM ] Token has not been received in 11829 ms
Aug 31 05:41:13 pm ceph-osd[7604]: 2023-08-31T05:41:13.578-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 9 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3.e5ece409 (undecoded) ondisk+write+known_if_redirected e2153)
Aug 31 05:41:14 pm ceph-osd[7604]: 2023-08-31T05:41:14.618-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 9 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3.e5ece409 (undecoded) ondisk+write+known_if_redirected e2153)
Aug 31 05:41:15 pm ceph-osd[7604]: 2023-08-31T05:41:15.662-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 13 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3.e5ece409 (undecoded) ondisk+write+known_if_redirected e2153)
Aug 31 05:41:15 pm corosync[7437]:   [TOTEM ] Token has not been received in 14660 ms
Aug 31 05:41:16 pm ceph-osd[7604]: 2023-08-31T05:41:16.634-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3.e5ece409 (undecoded) ondisk+write+known_if_redirected e2153)
Aug 31 05:41:16 pm pve-ha-lrm[548963]: Task 'UPID:pm:00086064:003907AA:64F05FD8:vzstart:111:root@pam:' still active, waiting
Aug 31 05:41:16 pm corosync[7437]:   [KNET  ] link: host: 2 link: 0 is down
Aug 31 05:41:16 pm corosync[7437]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 31 05:41:16 pm corosync[7437]:   [KNET  ] host: host: 2 has no active links
Aug 31 05:41:17 pm ceph-osd[7604]: 2023-08-31T05:41:17.650-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3.e5ece409 (undecoded) ondisk+write+known_if_redirected e2153)
Aug 31 05:41:18 pm ceph-osd[7604]: 2023-08-31T05:41:18.698-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3.e5ece409 (undecoded) ondisk+write+known_if_redirected e2153)
Aug 31 05:41:19 pm ceph-osd[7604]: 2023-08-31T05:41:19.658-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3.e5ece409 (undecoded) ondisk+write+known_if_redirected e2153)
Aug 31 05:41:19 pm corosync[7437]:   [KNET  ] rx: host: 2 link: 0 is up
Aug 31 05:41:19 pm corosync[7437]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Aug 31 05:41:19 pm corosync[7437]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 31 05:41:20 pm ceph-osd[7604]: 2023-08-31T05:41:20.686-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3.e5ece409 (undecoded) ondisk+write+known_if_redirected e2153)
Aug 31 05:41:21 pm ceph-osd[7604]: 2023-08-31T05:41:21.670-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3.e5ece409 (undecoded) ondisk+write+known_if_redirected e2153)
Aug 31 05:41:21 pm pve-ha-lrm[548963]: Task 'UPID:pm:00086064:003907AA:64F05FD8:vzstart:111:root@pam:' still active, waiting
Aug 31 05:41:22 pm ceph-osd[7604]: 2023-08-31T05:41:22.642-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3.e5ece409 (undecoded) ondisk+write+known_if_redirected e2153)
Aug 31 05:41:23 pm corosync[7437]:   [TOTEM ] Token has not been received in 22164 ms
Aug 31 05:41:23 pm ceph-osd[7604]: 2023-08-31T05:41:23.622-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3.e5ece409 (undecoded) ondisk+write+known_if_redirected e2153)
Aug 31 05:41:24 pm ceph-osd[7604]: 2023-08-31T05:41:24.634-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3:902737a7:::rbd_header.61f7662f92266:head [watch ping cookie 18446462598732841013 gen 5] snapc 0=[] ondisk+write+>
Aug 31 05:41:25 pm ceph-osd[7604]: 2023-08-31T05:41:25.646-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3:902737a7:::rbd_header.61f7662f92266:head [watch ping cookie 18446462598732841013 gen 5] snapc 0=[] ondisk+write+>
Aug 31 05:41:26 pm ceph-osd[7604]: 2023-08-31T05:41:26.646-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3:902737a7:::rbd_header.61f7662f92266:head [watch ping cookie 18446462598732841013 gen 5] snapc 0=[] ondisk+write+>
Aug 31 05:41:26 pm pve-ha-lrm[548963]: Task 'UPID:pm:00086064:003907AA:64F05FD8:vzstart:111:root@pam:' still active, waiting
Aug 31 05:41:27 pm corosync[7437]:   [TOTEM ] Token has not been received in 26511 ms
Aug 31 05:41:27 pm ceph-osd[7604]: 2023-08-31T05:41:27.690-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3:902737a7:::rbd_header.61f7662f92266:head [watch ping cookie 18446462598732841013 gen 5] snapc 0=[] ondisk+write+>
Aug 31 05:41:28 pm ceph-osd[7604]: 2023-08-31T05:41:28.722-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3:902737a7:::rbd_header.61f7662f92266:head [watch ping cookie 18446462598732841013 gen 5] snapc 0=[] ondisk+write+>
Aug 31 05:41:29 pm ceph-osd[7604]: 2023-08-31T05:41:29.714-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3:902737a7:::rbd_header.61f7662f92266:head [watch ping cookie 18446462598732841013 gen 5] snapc 0=[] ondisk+write+>
Aug 31 05:41:30 pm corosync[7437]:   [QUORUM] Sync members[3]: 1 2 3
Aug 31 05:41:30 pm corosync[7437]:   [TOTEM ] A new membership (1.789f) was formed. Members
Aug 31 05:41:30 pm corosync[7437]:   [TOTEM ] Retransmit List: 1 2 3
Aug 31 05:41:30 pm corosync[7437]:   [TOTEM ] Retransmit List: 1 2 3
Aug 31 05:41:30 pm ceph-osd[7604]: 2023-08-31T05:41:30.674-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3:902737a7:::rbd_header.61f7662f92266:head [watch ping cookie 18446462598732841013 gen 5] snapc 0=[] ondisk+write+>
Aug 31 05:41:30 pm corosync[7437]:   [KNET  ] link: host: 2 link: 0 is down
Aug 31 05:41:30 pm corosync[7437]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 31 05:41:30 pm corosync[7437]:   [KNET  ] host: host: 2 has no active links
Aug 31 05:41:31 pm ceph-osd[7604]: 2023-08-31T05:41:31.714-0400 7fa8d9fdc700 -1 osd.2 2155 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.6169981.0:2262955 3.9 3:902737a7:::rbd_header.61f7662f92266:head [watch ping cookie 18446462598732841013 gen 5] snapc 0=[] ondisk+write+>
Aug 31 05:41:31 pm pve-ha-lrm[548963]: Task 'UPID:pm:00086064:003907AA:64F05FD8:vzstart:111:root@pam:' still active, waiting
Aug 31 05:41:32 pm corosync[7437]:   [TOTEM ] Token has not been received in 2737 ms
Aug 31 05:41:33 pm corosync[7437]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.
Aug 31 05:41:34 pm corosync[7437]:   [KNET  ] rx: host: 2 link: 0 is up
Aug 31 05:41:34 pm corosync[7437]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Aug 31 05:41:34 pm corosync[7437]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 31 05:41:36 pm pve-ha-lrm[548963]: Task 'UPID:pm:00086064:003907AA:64F05FD8:vzstart:111:root@pam:' still active, waiting
Aug 31 05:41:40 pm kernel: EXT4-fs (rbd1): 3 orphan inodes deleted
Aug 31 05:41:40 pm kernel: EXT4-fs (rbd1): recovery complete
Aug 31 05:41:40 pm watchdog-mux[6972]: client watchdog expired - disable watchdog updates
Aug 31 05:41:40 pm kernel: EXT4-fs (rbd1): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
Aug 31 05:41:41 pm pve-ha-lrm[548963]: Task 'UPID:pm:00086064:003907AA:64F05FD8:vzstart:111:root@pam:' still active, waiting
Aug 31 05:41:41 pm corosync[7437]:   [KNET  ] link: host: 2 link: 0 is down
Aug 31 05:41:41 pm corosync[7437]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 31 05:41:41 pm corosync[7437]:   [KNET  ] host: host: 2 has no active links
Aug 31 05:41:41 pm corosync[7437]:   [TOTEM ] Token has not been received in 2737 ms
Aug 31 05:41:43 pm corosync[7437]:   [KNET  ] rx: host: 2 link: 0 is up
Aug 31 05:41:43 pm corosync[7437]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Aug 31 05:41:43 pm corosync[7437]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 31 05:41:44 pm audit[549313]: AVC apparmor="STATUS" operation="profile_load" profile="/usr/bin/lxc-start" name="lxc-111_</var/lib/lxc>" pid=549313 comm="apparmor_parser"
Aug 31 05:41:44 pm kernel: kauditd_printk_skb: 4 callbacks suppressed
Aug 31 05:41:44 pm kernel: audit: type=1400 audit(1693474904.022:36): apparmor="STATUS" operation="profile_load" profile="/usr/bin/lxc-start" name="lxc-111_</var/lib/lxc>" pid=549313 comm="apparmor_parser"
Aug 31 05:41:44 pm systemd-udevd[549317]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Aug 31 05:41:44 pm systemd-udevd[549317]: Using default interface naming scheme 'v247'.
Aug 31 05:41:44 pm corosync[7437]:   [KNET  ] pmtud: Global data MTU changed to: 1397
-- Boot 372c2269f92144edbbbfbeb931e60a5b --
 

Attachments

Hi,
is the Ceph network and the Corosync network the same? It's recommended to use a dedicated network for Corosync: https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_cluster_network

Although I read the documentation and understood the significance of having them on separate networks, I opted for the same network due to budget constraints. I acknowledge that this is not the recommended configuration and may lead to issues. In my homelab environment, I'm primarily experimenting and learning. I plan to separate them into distinct networks in the coming weeks. As I mentioned in my initial post I encountered a situation where one of the nodes became completely unresponsive and a power-fail is required to bring it back online. Is this expected behavior in such a configuration when when latency may exceed 5ms?

Thank you for your time and help!!!!
 
Is this expected behavior in such a configuration when when latency may exceed 5ms?
I don't think 5ms should be problematic, but are you sure there are not latency spikes that are bigger? The error messages indicate longer delays:
Aug 26 03:50:17 pm corosync[7436]: [TOTEM ] Token has not been received in 2737 ms
The more nodes you have, the more relevant it becomes.

EDIT: To be a bit more precise, up to 10 ms latency should be fine, but anything more might become problematic. As a simple way, you can use ping to see if you have any latency spikes.
 
Last edited:
  • Like
Reactions: godsavethequ33n
I don't think 5ms should be problematic, but are you sure there are not latency spikes that are bigger? The error messages indicate longer delays:

The more nodes you have, the more relevant it becomes.

EDIT: To be a bit more precise, up to 10 ms latency should be fine, but anything more might become problematic. As a simple way, you can use ping to see if you have any latency spikes.
I may have found the cause of the spikes:


Capture-min.PNG

I'll be rerunning that with conduit and a second pull for a separate network. Keep ya posted. Thanks again!!!
 
  • Like
Reactions: fiona
unshielded twisted pair for cluster interconnect ?
I know. Much shame.

It was initially a temporary solution for testing, but it ended up lasting longer than anticipated. The good news is that I've relocated the node that was on that shamefully decrepit connection, and my link issues have cleared up. I'm currently in the process of planning out the second network as instructed in the documentation.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!