Hello Guys,
Is it possible to increase the synchronization time between softdog and corosync?
After one of the servers that was down becomes UP, it seems that softdog cannot synchronize with corosync, and reboots the server. I needed some help to solve these reboots.
Jul 20 09:17:01 pve CRON[7496]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jul 20 09:17:01 pve CRON[7495]: pam_unix(cron:session): session closed for user root
Jul 20 09:17:54 pve kernel: EXT4-fs error (device loop0): ext4_lookup:1853: inode #394542: comm systemd-journal: deleted inode referenced: 394776
Jul 20 09:17:55 pve kernel: EXT4-fs error (device loop0): ext4_lookup:1853: inode #394585: comm rm: deleted inode referenced: 394780
Jul 20 09:17:55 pve kernel: EXT4-fs error (device loop0): ext4_lookup:1853: inode #394585: comm rm: deleted inode referenced: 394780
Jul 20 09:17:55 pve kernel: EXT4-fs error (device loop0): ext4_lookup:1853: inode #394585: comm configure-insta: deleted inode referenced: 394780
Jul 20 09:17:55 pve kernel: EXT4-fs error (device loop0): ext4_lookup:1853: inode #394585: comm find: deleted inode referenced: 394780
Jul 20 09:17:55 pve kernel: EXT4-fs error (device loop0): ext4_lookup:1853: inode #394585: comm cmp: deleted inode referenced: 394780
Jul 20 09:18:00 pve kernel: EXT4-fs error (device loop0): mb_free_blocks:1776: group 127, block 4170764:freeing already freed block (bit 9228); block bitmap corrupt.
Jul 20 09:18:25 pve kernel: EXT4-fs error (device loop0): ext4_lookup:1853: inode #394542: comm systemd-journal: deleted inode referenced: 394776
Jul 20 09:20:59 pve kernel: EXT4-fs (loop0): error count since last fsck: 91
Jul 20 09:20:59 pve kernel: EXT4-fs (loop0): initial error at time 1689612598: ext4_validate_inode_bitmap:105
Jul 20 09:20:59 pve kernel: EXT4-fs (loop0): last error at time 1689855505: ext4_lookup:1853: inode 394542
Jul 20 09:26:17 pve corosync[2816]: [KNET ] rx: host: 2 link: 0 is up
Jul 20 09:26:17 pve corosync[2816]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Jul 20 09:26:17 pve corosync[2816]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 20 09:26:17 pve corosync[2816]: [QUORUM] Sync members[2]: 1 2
Jul 20 09:26:17 pve corosync[2816]: [QUORUM] Sync joined[1]: 2
Jul 20 09:26:17 pve corosync[2816]: [TOTEM ] A new membership (1.e8) was formed. Members joined: 2
Jul 20 09:26:17 pve pmxcfs[2732]: [dcdb] notice: members: 1/2732, 2/2743
Jul 20 09:26:17 pve pmxcfs[2732]: [dcdb] notice: starting data syncronisation
Jul 20 09:26:17 pve corosync[2816]: [QUORUM] Members[2]: 1 2
Jul 20 09:26:17 pve corosync[2816]: [MAIN ] Completed service synchronization, ready to provide service.
Jul 20 09:26:17 pve corosync[2816]: [KNET ] pmtud: Global data MTU changed to: 1397
Jul 20 09:26:17 pve pmxcfs[2732]: [dcdb] notice: cpg_send_message retried 1 times
Jul 20 09:26:17 pve pmxcfs[2732]: [status] notice: members: 1/2732, 2/2743
Jul 20 09:26:17 pve pmxcfs[2732]: [status] notice: starting data syncronisation
Jul 20 09:26:17 pve pmxcfs[2732]: [dcdb] notice: received sync request (epoch 1/2732/00000003)
Jul 20 09:26:17 pve pmxcfs[2732]: [status] notice: received sync request (epoch 1/2732/00000003)
Jul 20 09:26:17 pve pmxcfs[2732]: [dcdb] notice: received all states
Jul 20 09:26:17 pve pmxcfs[2732]: [dcdb] notice: leader is 2/2743
Jul 20 09:26:17 pve pmxcfs[2732]: [dcdb] notice: synced members: 2/2743
Jul 20 09:26:17 pve pmxcfs[2732]: [dcdb] notice: waiting for updates from leader
Jul 20 09:26:17 pve pmxcfs[2732]: [status] notice: received all states
Jul 20 09:26:17 pve pmxcfs[2732]: [status] notice: all data is up to date
Jul 20 09:26:17 pve pmxcfs[2732]: [dcdb] notice: update complete - trying to commit (got 9 inode updates)
Jul 20 09:26:17 pve pmxcfs[2732]: [dcdb] notice: all data is up to date
Jul 20 09:26:20 pve pve-ha-lrm[2890]: lost lock 'ha_agent_pve_lock - cfs lock update failed - Permission denied
Jul 20 09:26:24 pve pve-ha-crm[2881]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied
Jul 20 09:26:25 pve pve-ha-lrm[2890]: status change active => lost_agent_lock
Jul 20 09:26:29 pve pve-ha-crm[2881]: status change master => lost_manager_lock
Jul 20 09:26:29 pve pve-ha-crm[2881]: watchdog closed (disabled)
Jul 20 09:26:29 pve pve-ha-crm[2881]: status change lost_manager_lock => wait_for_quorum
Jul 20 09:26:34 pve pve-ha-crm[2881]: status change wait_for_quorum => slave
Jul 20 09:27:11 pve watchdog-mux[2361]: client watchdog expired - disable watchdog updates
-- Reboot --
Jul 20 09:32:56 pve kernel: Linux version 6.2.16-3-pve (tom@sbuild) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-3 (2023-06-17T05:58Z) ()
Jul 20 09:32:56 pve kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.2.16-3-pve root=/dev/mapper/pve-root ro quiet
Is it possible to increase the synchronization time between softdog and corosync?
After one of the servers that was down becomes UP, it seems that softdog cannot synchronize with corosync, and reboots the server. I needed some help to solve these reboots.
Jul 20 09:17:01 pve CRON[7496]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jul 20 09:17:01 pve CRON[7495]: pam_unix(cron:session): session closed for user root
Jul 20 09:17:54 pve kernel: EXT4-fs error (device loop0): ext4_lookup:1853: inode #394542: comm systemd-journal: deleted inode referenced: 394776
Jul 20 09:17:55 pve kernel: EXT4-fs error (device loop0): ext4_lookup:1853: inode #394585: comm rm: deleted inode referenced: 394780
Jul 20 09:17:55 pve kernel: EXT4-fs error (device loop0): ext4_lookup:1853: inode #394585: comm rm: deleted inode referenced: 394780
Jul 20 09:17:55 pve kernel: EXT4-fs error (device loop0): ext4_lookup:1853: inode #394585: comm configure-insta: deleted inode referenced: 394780
Jul 20 09:17:55 pve kernel: EXT4-fs error (device loop0): ext4_lookup:1853: inode #394585: comm find: deleted inode referenced: 394780
Jul 20 09:17:55 pve kernel: EXT4-fs error (device loop0): ext4_lookup:1853: inode #394585: comm cmp: deleted inode referenced: 394780
Jul 20 09:18:00 pve kernel: EXT4-fs error (device loop0): mb_free_blocks:1776: group 127, block 4170764:freeing already freed block (bit 9228); block bitmap corrupt.
Jul 20 09:18:25 pve kernel: EXT4-fs error (device loop0): ext4_lookup:1853: inode #394542: comm systemd-journal: deleted inode referenced: 394776
Jul 20 09:20:59 pve kernel: EXT4-fs (loop0): error count since last fsck: 91
Jul 20 09:20:59 pve kernel: EXT4-fs (loop0): initial error at time 1689612598: ext4_validate_inode_bitmap:105
Jul 20 09:20:59 pve kernel: EXT4-fs (loop0): last error at time 1689855505: ext4_lookup:1853: inode 394542
Jul 20 09:26:17 pve corosync[2816]: [KNET ] rx: host: 2 link: 0 is up
Jul 20 09:26:17 pve corosync[2816]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Jul 20 09:26:17 pve corosync[2816]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 20 09:26:17 pve corosync[2816]: [QUORUM] Sync members[2]: 1 2
Jul 20 09:26:17 pve corosync[2816]: [QUORUM] Sync joined[1]: 2
Jul 20 09:26:17 pve corosync[2816]: [TOTEM ] A new membership (1.e8) was formed. Members joined: 2
Jul 20 09:26:17 pve pmxcfs[2732]: [dcdb] notice: members: 1/2732, 2/2743
Jul 20 09:26:17 pve pmxcfs[2732]: [dcdb] notice: starting data syncronisation
Jul 20 09:26:17 pve corosync[2816]: [QUORUM] Members[2]: 1 2
Jul 20 09:26:17 pve corosync[2816]: [MAIN ] Completed service synchronization, ready to provide service.
Jul 20 09:26:17 pve corosync[2816]: [KNET ] pmtud: Global data MTU changed to: 1397
Jul 20 09:26:17 pve pmxcfs[2732]: [dcdb] notice: cpg_send_message retried 1 times
Jul 20 09:26:17 pve pmxcfs[2732]: [status] notice: members: 1/2732, 2/2743
Jul 20 09:26:17 pve pmxcfs[2732]: [status] notice: starting data syncronisation
Jul 20 09:26:17 pve pmxcfs[2732]: [dcdb] notice: received sync request (epoch 1/2732/00000003)
Jul 20 09:26:17 pve pmxcfs[2732]: [status] notice: received sync request (epoch 1/2732/00000003)
Jul 20 09:26:17 pve pmxcfs[2732]: [dcdb] notice: received all states
Jul 20 09:26:17 pve pmxcfs[2732]: [dcdb] notice: leader is 2/2743
Jul 20 09:26:17 pve pmxcfs[2732]: [dcdb] notice: synced members: 2/2743
Jul 20 09:26:17 pve pmxcfs[2732]: [dcdb] notice: waiting for updates from leader
Jul 20 09:26:17 pve pmxcfs[2732]: [status] notice: received all states
Jul 20 09:26:17 pve pmxcfs[2732]: [status] notice: all data is up to date
Jul 20 09:26:17 pve pmxcfs[2732]: [dcdb] notice: update complete - trying to commit (got 9 inode updates)
Jul 20 09:26:17 pve pmxcfs[2732]: [dcdb] notice: all data is up to date
Jul 20 09:26:20 pve pve-ha-lrm[2890]: lost lock 'ha_agent_pve_lock - cfs lock update failed - Permission denied
Jul 20 09:26:24 pve pve-ha-crm[2881]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied
Jul 20 09:26:25 pve pve-ha-lrm[2890]: status change active => lost_agent_lock
Jul 20 09:26:29 pve pve-ha-crm[2881]: status change master => lost_manager_lock
Jul 20 09:26:29 pve pve-ha-crm[2881]: watchdog closed (disabled)
Jul 20 09:26:29 pve pve-ha-crm[2881]: status change lost_manager_lock => wait_for_quorum
Jul 20 09:26:34 pve pve-ha-crm[2881]: status change wait_for_quorum => slave
Jul 20 09:27:11 pve watchdog-mux[2361]: client watchdog expired - disable watchdog updates
-- Reboot --
Jul 20 09:32:56 pve kernel: Linux version 6.2.16-3-pve (tom@sbuild) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-3 (2023-06-17T05:58Z) ()
Jul 20 09:32:56 pve kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.2.16-3-pve root=/dev/mapper/pve-root ro quiet