Why node PVE1 rebooted with HA 3 node

tyjtyj · Sep 1, 2023

Pve1 was running pvecm e 1.
pve2 and pve3 were shutdown and brought up together.

How do check logs PVE1 reloaded due to losing quorum ?

I brought up pve2 and pve3
15:57:15 -- pve2 started
15:57:24 -- pve3 started
16:09:07 -- Pve1 lost watchdog and reloaded(Losing 3 HA VM)

PVE1
Aug 31 15:50:01 pve1 CRON[2437022]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Aug 31 15:50:01 pve1 CRON[2437023]: (root) CMD (/root/script/mqtt_push.sh >/dev/null 2>&1)
Aug 31 15:50:07 pve1 CRON[2437022]: pam_unix(cron:session): session closed for user root
Aug 31 15:53:44 pve1 pvedaemon[2257684]: <root@pam> successful auth for user 'root@pam'
Aug 31 15:55:01 pve1 CRON[2445387]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Aug 31 15:55:01 pve1 CRON[2445388]: (root) CMD (/root/script/mqtt_push.sh >/dev/null 2>&1)
Aug 31 15:55:07 pve1 CRON[2445387]: pam_unix(cron:session): session closed for user root
Aug 31 15:58:47 pve1 mergerfs[799]: running basic garbage collection
Aug 31 15:59:19 pve1 pmxcfs[2922]: [dcdb] notice: data verification successful
Aug 31 15:59:42 pve1 mergerfs[711303]: running basic garbage collection
Sep 01 15:50:01 pve1 CRON[861107]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 01 15:50:01 pve1 CRON[861108]: (root) CMD (/root/script/mqtt_push.sh >/dev/null 2>&1)
Sep 01 15:50:07 pve1 CRON[861107]: pam_unix(cron:session): session closed for user root
Sep 01 15:55:01 pve1 CRON[869356]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 01 15:55:01 pve1 CRON[869357]: (root) CMD (/root/script/mqtt_push.sh >/dev/null 2>&1)
Sep 01 15:55:07 pve1 CRON[869356]: pam_unix(cron:session): session closed for user root
Sep 01 15:56:09 pve1 pvedaemon[3077783]: <root@pam> successful auth for user 'root@pam'
Sep 01 15:57:15 pve1 corosync[2987]: [KNET ] rx: host: 2 link: 0 is up
Sep 01 15:57:15 pve1 corosync[2987]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Sep 01 15:57:15 pve1 corosync[2987]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 01 15:57:15 pve1 corosync[2987]: [KNET ] pmtud: Global data MTU changed to: 1397
Sep 01 15:57:15 pve1 corosync[2987]: [QUORUM] Sync members[2]: 1 2
Sep 01 15:57:15 pve1 corosync[2987]: [QUORUM] Sync joined[1]: 2
Sep 01 15:57:15 pve1 corosync[2987]: [TOTEM ] A new membership (1.10f5) was formed. Members joined: 2
Sep 01 15:57:15 pve1 corosync[2987]: [QUORUM] Members[2]: 1 2
Sep 01 15:57:15 pve1 corosync[2987]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 01 15:57:16 pve1 pmxcfs[2922]: [dcdb] notice: members: 1/2922, 2/1605
Sep 01 15:57:16 pve1 pmxcfs[2922]: [dcdb] notice: starting data syncronisation
Sep 01 15:57:16 pve1 pmxcfs[2922]: [status] notice: members: 1/2922, 2/1605
Sep 01 15:57:16 pve1 pmxcfs[2922]: [status] notice: starting data syncronisation
Sep 01 15:57:16 pve1 pmxcfs[2922]: [dcdb] notice: received sync request (epoch 1/2922/00000005)
Sep 01 15:57:16 pve1 pmxcfs[2922]: [status] notice: received sync request (epoch 1/2922/00000005)
Sep 01 15:57:16 pve1 pmxcfs[2922]: [dcdb] notice: received all states
Sep 01 15:57:16 pve1 pmxcfs[2922]: [dcdb] notice: leader is 1/2922
Sep 01 15:57:16 pve1 pmxcfs[2922]: [dcdb] notice: synced members: 1/2922
Sep 01 15:57:16 pve1 pmxcfs[2922]: [dcdb] notice: start sending inode updates
Sep 01 15:57:16 pve1 pmxcfs[2922]: [dcdb] notice: sent all (6) updates
Sep 01 15:57:16 pve1 pmxcfs[2922]: [dcdb] notice: all data is up to date
Sep 01 15:57:16 pve1 pmxcfs[2922]: [status] notice: received all states
Sep 01 15:57:16 pve1 pmxcfs[2922]: [status] notice: all data is up to date
Sep 01 15:57:20 pve1 pmxcfs[2922]: [status] notice: received log
Sep 01 15:57:20 pve1 pmxcfs[2922]: [status] notice: received log
Sep 01 15:57:24 pve1 corosync[2987]: [KNET ] rx: host: 3 link: 0 is up
Sep 01 15:57:24 pve1 corosync[2987]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Sep 01 15:57:24 pve1 corosync[2987]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Sep 01 15:57:24 pve1 corosync[2987]: [QUORUM] Sync members[3]: 1 2 3
Sep 01 15:57:24 pve1 corosync[2987]: [QUORUM] Sync joined[1]: 3
Sep 01 15:57:24 pve1 corosync[2987]: [TOTEM ] A new membership (1.10f9) was formed. Members joined: 3
Sep 01 15:57:24 pve1 corosync[2987]: [QUORUM] Members[3]: 1 2 3
Sep 01 15:57:24 pve1 corosync[2987]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 01 15:57:24 pve1 corosync[2987]: [KNET ] pmtud: Global data MTU changed to: 1397
Sep 01 15:57:26 pve1 pve-ha-crm[3141]: node 'pve2': state changed from 'maintenance' => 'online'
Sep 01 15:57:26 pve1 pmxcfs[2922]: [dcdb] notice: members: 1/2922, 2/1605, 3/2197
Sep 01 15:57:26 pve1 pmxcfs[2922]: [dcdb] notice: starting data syncronisation
Sep 01 15:57:26 pve1 pmxcfs[2922]: [status] notice: members: 1/2922, 2/1605, 3/2197
Sep 01 15:57:26 pve1 pmxcfs[2922]: [status] notice: starting data syncronisation
Sep 01 15:57:26 pve1 pmxcfs[2922]: [dcdb] notice: received sync request (epoch 1/2922/00000006)
Sep 01 15:57:26 pve1 pmxcfs[2922]: [status] notice: received sync request (epoch 1/2922/00000006)
Sep 01 15:57:26 pve1 pmxcfs[2922]: [dcdb] notice: received all states
Sep 01 15:57:26 pve1 pmxcfs[2922]: [dcdb] notice: leader is 1/2922
Sep 01 15:57:26 pve1 pmxcfs[2922]: [dcdb] notice: synced members: 1/2922, 2/1605
Sep 01 15:57:26 pve1 pmxcfs[2922]: [dcdb] notice: start sending inode updates
Sep 01 15:57:26 pve1 pmxcfs[2922]: [dcdb] notice: sent all (20) updates
Sep 01 15:57:26 pve1 pmxcfs[2922]: [dcdb] notice: all data is up to date
Sep 01 15:57:26 pve1 pmxcfs[2922]: [status] notice: received all states
Sep 01 15:57:26 pve1 pmxcfs[2922]: [status] notice: all data is up to date
Sep 01 15:57:30 pve1 pmxcfs[2922]: [status] notice: received log
Sep 01 15:57:30 pve1 pmxcfs[2922]: [status] notice: received log
Sep 01 15:57:31 pve1 pveproxy[438364]: Clearing outdated entries from certificate cache
Sep 01 15:57:36 pve1 pve-ha-crm[3141]: node 'pve3': state changed from 'maintenance' => 'online'
Sep 01 15:58:48 pve1 mergerfs[799]: running basic garbage collection
Sep 01 15:59:14 pve1 pveproxy[435876]: Clearing outdated entries from certificate cache
Sep 01 15:59:19 pve1 pmxcfs[2922]: [dcdb] notice: data verification successful
Sep 01 15:59:22 pve1 pveproxy[454530]: Clearing outdated entries from certificate cache
Sep 01 15:59:44 pve1 mergerfs[711303]: running basic garbage collection
Sep 01 16:00:01 pve1 CRON[902608]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 01 16:00:01 pve1 CRON[902609]: (root) CMD (/root/script/mqtt_push.sh >/dev/null 2>&1)
Sep 01 16:00:07 pve1 CRON[902608]: pam_unix(cron:session): session closed for user root
Sep 01 16:01:01 pve1 sshd[904977]: Accepted password for root from 192.168.11.237 port 59857 ssh2
Sep 01 16:01:01 pve1 sshd[904977]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Sep 01 16:01:01 pve1 systemd-logind[2324]: New session 3480 of user root.
Sep 01 16:01:01 pve1 systemd[1]: Started session-3480.scope - Session 3480 of User root.
Sep 01 16:01:01 pve1 sshd[904977]: pam_env(sshd:session): deprecated reading of user environment enabled
Sep 01 16:01:02 pve1 pveproxy[435876]: worker exit
Sep 01 16:01:02 pve1 pveproxy[3145]: worker 435876 finished
Sep 01 16:01:02 pve1 pveproxy[3145]: starting 1 worker(s)
Sep 01 16:01:02 pve1 pveproxy[3145]: worker 905104 started
Sep 01 16:01:45 pve1 pveproxy[438364]: worker exit
Sep 01 16:01:45 pve1 pveproxy[3145]: worker 438364 finished
Sep 01 16:01:45 pve1 pveproxy[3145]: starting 1 worker(s)
Sep 01 16:01:45 pve1 pveproxy[3145]: worker 906566 started
Sep 01 16:02:45 pve1 pveproxy[906566]: Clearing outdated entries from certificate cache
Sep 01 16:05:21 pve1 pveproxy[905104]: Clearing outdated entries from certificate cache
Sep 01 16:09:07 pve1 watchdog-mux[2327]: client watchdog expired - disable watchdog updates
Sep 01 16:09:09 pve1 watchdog-mux[2327]: exit watchdog-mux with active connections
Sep 01 16:09:09 pve1 systemd[1]: watchdog-mux.service: Deactivated successfully.
Sep 01 16:09:10 pve1 systemd-journald[590]: Received client request to sync journal.
Sep 01 16:09:10 pve1 kernel: watchdog: watchdog0: watchdog did not stop!
Sep 01 16:09:58 pve1 kernel: Linux version 6.2.16-8-pve (fgruenbichler@yuna) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.2.16-8 (2023-08-02T12:17Z) ()
Sep 01 16:09:58 pve1 kernel: Command line: initrd=\EFI\proxmox\6.2.16-8-pve\initrd.img-6.2.16-8-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs intel_iommu=on reboot=efi

Maximiliano · Sep 1, 2023

When a node does not have quorum it will fence (shutdown) all HA guest on it so they can be recover on the part of the cluster that has quorum. Of course if no one has quorum the later won't happen. Emphasizes has to be put on the fact that if the VM does not fence itself and start in another guest you end up in a situation with two guests using the same storage, leading to potential data loss.

You can find more info a [1].

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#ha_manager_fencing

tyjtyj · Sep 1, 2023

Maximiliano said:
When a node does not have quorum it will fence (shutdown) all HA guest on it so they can be recover on the part of the cluster that has quorum. Of course if no one has quorum the later won't happen. Emphasizes has to be put on the fact that if the VM does not fence itself and start in another guest you end up in a situation with two guests using the same storage, leading to potential data loss.

You can find more info a [1].

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#ha_manager_fencing

Hi.
Thanks you for your reply

Unfortunately in this case, 3 node is running and pve1 suddenly reloaded. Is there a way i can check the log confirm it is reloaded due to fencing?

If you see the log, there are 3 node
Sep 01 15:57:24 pve1 corosync[2987]: [QUORUM] Sync joined[1]: 3
Nothing about it since then

These is the last message before it reboots
Sep 01 16:09:07 pve1 watchdog-mux[2327]: client watchdog expired - disable watchdog updates
Sep 01 16:09:09 pve1 watchdog-mux[2327]: exit watchdog-mux with active connections
Sep 01 16:09:09 pve1 systemd[1]: watchdog-mux.service: Deactivated successfully.
Sep 01 16:09:10 pve1 systemd-journald[590]: Received client request to sync journal.
Sep 01 16:09:10 pve1 kernel: watchdog: watchdog0: watchdog did not stop!

jsterr · Sep 1, 2023

tyjtyj said:
Hi.
Thanks you for your reply

Unfortunately in this case, 3 node is running and pve1 suddenly reloaded. Is there a way i can check the log confirm it is reloaded due to fencing?

Hello @tyjtyj check the logs with: journalctl -u corosync Please also post /etc/pve/corosync.conf and /etc/network/interfaces and /etc/hosts

tyjtyj · Sep 1, 2023

Pretty much watchdog root cause. I had this issue multiple times. it all random, When other node return from poweroff, pve1 reloaded.
I also check replicated all completed before Sep 01 15:59:59.

Sep 01 15:57:15 pve1 corosync[2987]: [KNET ] rx: host: 2 link: 0 is up
Sep 01 15:57:15 pve1 corosync[2987]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Sep 01 15:57:15 pve1 corosync[2987]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 01 15:57:15 pve1 corosync[2987]: [KNET ] pmtud: Global data MTU changed to: 1397
Sep 01 15:57:15 pve1 corosync[2987]: [QUORUM] Sync members[2]: 1 2
Sep 01 15:57:15 pve1 corosync[2987]: [QUORUM] Sync joined[1]: 2
Sep 01 15:57:15 pve1 corosync[2987]: [TOTEM ] A new membership (1.10f5) was formed. Members joined: 2
Sep 01 15:57:15 pve1 corosync[2987]: [QUORUM] Members[2]: 1 2
Sep 01 15:57:15 pve1 corosync[2987]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 01 15:57:24 pve1 corosync[2987]: [KNET ] rx: host: 3 link: 0 is up
Sep 01 15:57:24 pve1 corosync[2987]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Sep 01 15:57:24 pve1 corosync[2987]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Sep 01 15:57:24 pve1 corosync[2987]: [QUORUM] Sync members[3]: 1 2 3
Sep 01 15:57:24 pve1 corosync[2987]: [QUORUM] Sync joined[1]: 3
Sep 01 15:57:24 pve1 corosync[2987]: [TOTEM ] A new membership (1.10f9) was formed. Members joined: 3
Sep 01 15:57:24 pve1 corosync[2987]: [QUORUM] Members[3]: 1 2 3
Sep 01 15:57:24 pve1 corosync[2987]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 01 15:57:24 pve1 corosync[2987]: [KNET ] pmtud: Global data MTU changed to: 1397
-- Boot a7f0d90a95a541b48e6324be0d2e493a --
Sep 01 16:10:46 pve1 systemd[1]: Starting corosync.service - Corosync Cluster Engine...
Sep 01 16:10:46 pve1 corosync[3718]: [MAIN ] Corosync Cluster Engine starting up
Sep 01 16:10:46 pve1 corosync[3718]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow
Sep 01 16:10:46 pve1 corosync[3718]: [TOTEM ] Initializing transport (Kronosnet).
Sep 01 16:10:47 pve1 corosync[3718]: [TOTEM ] totemknet initialized
Sep 01 16:10:47 pve1 corosync[3718]: [KNET ] pmtud: MTU manually set to: 0
Sep 01 16:10:47 pve1 corosync[3718]: [KNET ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Sep 01 16:10:47 pve1 corosync[3718]: [SERV ] Service engine loaded: corosync configuration map access [0]
Sep 01 16:10:47 pve1 corosync[3718]: [QB ] server name: cmap
Sep 01 16:10:47 pve1 corosync[3718]: [SERV ] Service engine loaded: corosync configuration service [1]

cat /etc/pve/corosync.conf

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: pve1
nodeid: 1
quorum_votes: 1
ring0_addr: 192.168.1.41
}
node {
name: pve2
nodeid: 2
quorum_votes: 1
ring0_addr: 192.168.1.42
}
node {
name: pve3
nodeid: 3
quorum_votes: 1
ring0_addr: 192.168.1.43
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: Cluster
config_version: 21
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}

pve1:/var/log# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
192.168.1.41 pve1.removed.com pve1
192.168.1.42 pve2.removed.com pve2
192.168.1.43 pve3.removed.com pve3
192.168.1.117 pbs1.removed.com pbs1

tyjtyj · Sep 1, 2023

Now that I pretty sure it is watchdog. Basically all previous event also same root cause

Code:

root@pve1:/var/log# journalctl -u watchdog-mux | grep expired
Dec 04 06:10:31 pve1 watchdog-mux[1827]: client watchdog expired - disable watchdog updates
Dec 27 05:50:50 pve1 watchdog-mux[1824]: client watchdog expired - disable watchdog updates
Jul 24 12:21:59 pve1 watchdog-mux[2202]: client watchdog expired - disable watchdog updates
Jul 27 22:29:08 pve1 watchdog-mux[2278]: client watchdog expired - disable watchdog updates
Aug 11 16:34:02 pve1 watchdog-mux[2126]: client watchdog expired - disable watchdog updates
Sep 01 16:09:07 pve1 watchdog-mux[2327]: client watchdog expired - disable watchdog updates

Seems this does not affect quorum because I am tracking quorum every 10s since last issue.

Search

Search

Why node PVE1 rebooted with HA 3 node

tyjtyj

New Member

Maximiliano

Proxmox Staff Member

tyjtyj

New Member

jsterr

Well-Known Member

tyjtyj

New Member

tyjtyj

New Member