As we have already talked about reboots, here's one fresh. From 4.4-5 to 4.4-13. Reboot is at the end.
Apr 12 14:55:47 srv-01-szd systemd[1]: Stopped Corosync Cluster Engine.
Apr 12 14:55:47 srv-01-szd systemd[1]: Starting Corosync Cluster Engine...
Apr 12 14:55:47 srv-01-szd corosync[26358]: [MAIN ] Corosync Cluster Engine ('2.4.2'): started and ready to provide service.
Apr 12 14:55:47 srv-01-szd corosync[26358]: [MAIN ] Corosync built-in features: augeas systemd pie relro bindnow
Apr 12 14:55:47 srv-01-szd corosync[26359]: [TOTEM ] Initializing transport (UDP/IP Multicast).
Apr 12 14:55:47 srv-01-szd corosync[26359]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Apr 12 14:55:47 srv-01-szd corosync[26359]: [TOTEM ] The network interface [10.63.1.211] is now up.
Apr 12 14:55:47 srv-01-szd corosync[26359]: [SERV ] Service engine loaded: corosync configuration map access [0]
Apr 12 14:55:47 srv-01-szd corosync[26359]: [QB ] server name: cmap
Apr 12 14:55:47 srv-01-szd corosync[26359]: [SERV ] Service engine loaded: corosync configuration service [1]
Apr 12 14:55:47 srv-01-szd corosync[26359]: [QB ] server name: cfg
Apr 12 14:55:47 srv-01-szd corosync[26359]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Apr 12 14:55:47 srv-01-szd corosync[26359]: [QB ] server name: cpg
Apr 12 14:55:47 srv-01-szd corosync[26359]: [SERV ] Service engine loaded: corosync profile loading service [4]
Apr 12 14:55:47 srv-01-szd corosync[26359]: [QUORUM] Using quorum provider corosync_votequorum
Apr 12 14:55:47 srv-01-szd corosync[26359]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
Apr 12 14:55:47 srv-01-szd corosync[26359]: [QB ] server name: votequorum
Apr 12 14:55:47 srv-01-szd corosync[26359]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Apr 12 14:55:47 srv-01-szd corosync[26359]: [QB ] server name: quorum
Apr 12 14:55:48 srv-01-szd corosync[26359]: [TOTEM ] A new membership (10.63.1.211:308) was formed. Members joined: 1 2 3 4
Apr 12 14:55:48 srv-01-szd corosync[26359]: [QUORUM] This node is within the primary component and will provide service.
Apr 12 14:55:48 srv-01-szd corosync[26359]: [QUORUM] Members[4]: 1 2 3 4
Apr 12 14:55:48 srv-01-szd corosync[26359]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 12 14:55:48 srv-01-szd corosync[26352]: Starting Corosync Cluster Engine (corosync): [ OK ]
Apr 12 14:55:48 srv-01-szd systemd[1]: Started Corosync Cluster Engine.
Apr 12 14:55:48 srv-01-szd systemd[1]: Reloading PVE API Daemon.
Apr 12 14:55:50 srv-01-szd pvedaemon[26381]: send HUP to 3437
Apr 12 14:55:50 srv-01-szd pvedaemon[3437]: received signal HUP
Apr 12 14:55:50 srv-01-szd pvedaemon[3437]: server closing
Apr 12 14:55:50 srv-01-szd pvedaemon[3437]: server shutdown (restart)
Apr 12 14:55:50 srv-01-szd pvedaemon[15325]: worker exit
Apr 12 14:55:50 srv-01-szd pvedaemon[15327]: worker exit
Apr 12 14:55:50 srv-01-szd systemd[1]: Reloaded PVE API Daemon.
Apr 12 14:55:50 srv-01-szd systemd[1]: Reloading PVE Status Daemon.
Apr 12 14:55:50 srv-01-szd pvestatd[26389]: send HUP to 3412
Apr 12 14:55:50 srv-01-szd pvestatd[3412]: received signal HUP
Apr 12 14:55:50 srv-01-szd pvestatd[3412]: server shutdown (restart)
Apr 12 14:55:50 srv-01-szd systemd[1]: Reloaded PVE Status Daemon.
Apr 12 14:55:50 srv-01-szd systemd[1]: Reloading PVE API Proxy Server.
Apr 12 14:55:51 srv-01-szd pvedaemon[26385]: worker exit
Apr 12 14:55:51 srv-01-szd pvedaemon[3437]: restarting server
Apr 12 14:55:51 srv-01-szd pvedaemon[3437]: worker 15326 finished
Apr 12 14:55:51 srv-01-szd pvedaemon[3437]: worker 15327 finished
Apr 12 14:55:51 srv-01-szd pvedaemon[3437]: worker 15325 finished
Apr 12 14:55:51 srv-01-szd pvedaemon[3437]: starting 3 worker(s)
Apr 12 14:55:51 srv-01-szd pvedaemon[3437]: worker 26398 started
Apr 12 14:55:51 srv-01-szd pvedaemon[3437]: worker 26399 started
Apr 12 14:55:51 srv-01-szd pvedaemon[3437]: worker 26400 started
Apr 12 14:55:51 srv-01-szd watchdog-mux[1605]: client watchdog expired - disable watchdog updates
Apr 12 14:55:51 srv-01-szd pvestatd[3412]: restarting server
Apr 12 14:55:51 srv-01-szd pveproxy[26394]: send HUP to 40759
Apr 12 14:55:51 srv-01-szd pveproxy[40759]: received signal HUP
Apr 12 14:55:51 srv-01-szd pveproxy[40759]: server closing
Apr 12 14:55:51 srv-01-szd pveproxy[40759]: server shutdown (restart)
Apr 12 14:55:51 srv-01-szd pveproxy[15355]: worker exit
Apr 12 14:55:51 srv-01-szd pveproxy[15353]: worker exit
Apr 12 14:55:51 srv-01-szd systemd[1]: Reloaded PVE API Proxy Server.
Apr 12 14:55:51 srv-01-szd systemd[1]: Reloading PVE SPICE Proxy Server.
Apr 12 14:55:52 srv-01-szd pveproxy[40759]: Using '/etc/pve/local/pveproxy-ssl.pem' as certificate for the web interface.
Apr 12 14:55:52 srv-01-szd pveproxy[40759]: restarting server
Apr 12 14:55:52 srv-01-szd pveproxy[40759]: worker 15355 finished
Apr 12 14:55:52 srv-01-szd pveproxy[40759]: worker 15353 finished
Apr 12 14:55:52 srv-01-szd pveproxy[40759]: worker 15354 finished
Apr 12 14:55:52 srv-01-szd pveproxy[40759]: starting 3 worker(s)
Apr 12 14:55:52 srv-01-szd spiceproxy[26409]: send HUP to 40788
Apr 12 14:55:52 srv-01-szd spiceproxy[40788]: received signal HUP
Apr 12 14:55:52 srv-01-szd spiceproxy[40788]: server closing
Apr 12 14:55:52 srv-01-szd spiceproxy[40788]: server shutdown (restart)
Apr 12 14:55:52 srv-01-szd pveproxy[40759]: worker 26418 started
Apr 12 14:55:52 srv-01-szd spiceproxy[16296]: worker exit
Apr 12 14:55:52 srv-01-szd pveproxy[40759]: worker 26419 started
Apr 12 14:55:52 srv-01-szd pveproxy[40759]: worker 26420 started
Apr 12 14:55:52 srv-01-szd systemd[1]: Reloaded PVE SPICE Proxy Server.
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [status] notice: update cluster info (cluster name nmscluster, version = 4)
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [dcdb] notice: start cluster connection
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [status] notice: start cluster connection
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [status] notice: node has quorum
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [dcdb] notice: members: 1/6099, 2/3032, 3/25030, 4/3147
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [dcdb] notice: starting data syncronisation
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [dcdb] notice: received sync request (epoch 1/6099/00000003)
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [status] notice: members: 1/6099, 2/3032, 3/25030, 4/3147
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [status] notice: starting data syncronisation
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [status] notice: received sync request (epoch 1/6099/00000003)
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [dcdb] notice: received all states
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [dcdb] notice: leader is 2/3032
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [dcdb] notice: synced members: 2/3032, 3/25030, 4/3147
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [dcdb] notice: waiting for updates from leader
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [status] notice: received all states
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [status] notice: all data is up to date
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [dcdb] notice: update complete - trying to commit (got 10 inode updates)
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [dcdb] notice: all data is up to date
Apr 12 14:55:53 srv-01-szd spiceproxy[40788]: restarting server
Apr 12 14:55:53 srv-01-szd spiceproxy[40788]: worker 16296 finished
Apr 12 14:55:53 srv-01-szd spiceproxy[40788]: starting 1 worker(s)
Apr 12 14:55:53 srv-01-szd spiceproxy[40788]: worker 26962 started
Apr 12 14:55:53 srv-01-szd pveproxy[26405]: worker exit
Apr 12 14:55:55 srv-01-szd pve-ha-lrm[7431]: successfully acquired lock 'ha_agent_srv-01-szd_lock'
Apr 12 14:55:55 srv-01-szd pve-ha-lrm[7431]: status change lost_agent_lock => active
Apr 12 14:55:55 srv-01-szd watchdog-mux[1605]: exit watchdog-mux with active connections
Apr 12 14:55:55 srv-01-szd kernel: [94765.098998] watchdog watchdog0: watchdog did not stop!
Apr 12 14:56:01 srv-01-szd cron[3043]: (*system*zfsutils-linux) RELOAD (/etc/cron.d/zfsutils-linux)
Apr 12 14:56:01 srv-01-szd pve-ha-crm[7657]: status change wait_for_quorum => slave
REBOOT
This is clearly not optimal.
As a sidenote we upgrade the system in three steps, to prevent the cluster from falling apart, but it seems more granularity is needed... *sigh*
Apr 12 14:55:47 srv-01-szd systemd[1]: Stopped Corosync Cluster Engine.
Apr 12 14:55:47 srv-01-szd systemd[1]: Starting Corosync Cluster Engine...
Apr 12 14:55:47 srv-01-szd corosync[26358]: [MAIN ] Corosync Cluster Engine ('2.4.2'): started and ready to provide service.
Apr 12 14:55:47 srv-01-szd corosync[26358]: [MAIN ] Corosync built-in features: augeas systemd pie relro bindnow
Apr 12 14:55:47 srv-01-szd corosync[26359]: [TOTEM ] Initializing transport (UDP/IP Multicast).
Apr 12 14:55:47 srv-01-szd corosync[26359]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Apr 12 14:55:47 srv-01-szd corosync[26359]: [TOTEM ] The network interface [10.63.1.211] is now up.
Apr 12 14:55:47 srv-01-szd corosync[26359]: [SERV ] Service engine loaded: corosync configuration map access [0]
Apr 12 14:55:47 srv-01-szd corosync[26359]: [QB ] server name: cmap
Apr 12 14:55:47 srv-01-szd corosync[26359]: [SERV ] Service engine loaded: corosync configuration service [1]
Apr 12 14:55:47 srv-01-szd corosync[26359]: [QB ] server name: cfg
Apr 12 14:55:47 srv-01-szd corosync[26359]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Apr 12 14:55:47 srv-01-szd corosync[26359]: [QB ] server name: cpg
Apr 12 14:55:47 srv-01-szd corosync[26359]: [SERV ] Service engine loaded: corosync profile loading service [4]
Apr 12 14:55:47 srv-01-szd corosync[26359]: [QUORUM] Using quorum provider corosync_votequorum
Apr 12 14:55:47 srv-01-szd corosync[26359]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
Apr 12 14:55:47 srv-01-szd corosync[26359]: [QB ] server name: votequorum
Apr 12 14:55:47 srv-01-szd corosync[26359]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Apr 12 14:55:47 srv-01-szd corosync[26359]: [QB ] server name: quorum
Apr 12 14:55:48 srv-01-szd corosync[26359]: [TOTEM ] A new membership (10.63.1.211:308) was formed. Members joined: 1 2 3 4
Apr 12 14:55:48 srv-01-szd corosync[26359]: [QUORUM] This node is within the primary component and will provide service.
Apr 12 14:55:48 srv-01-szd corosync[26359]: [QUORUM] Members[4]: 1 2 3 4
Apr 12 14:55:48 srv-01-szd corosync[26359]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 12 14:55:48 srv-01-szd corosync[26352]: Starting Corosync Cluster Engine (corosync): [ OK ]
Apr 12 14:55:48 srv-01-szd systemd[1]: Started Corosync Cluster Engine.
Apr 12 14:55:48 srv-01-szd systemd[1]: Reloading PVE API Daemon.
Apr 12 14:55:50 srv-01-szd pvedaemon[26381]: send HUP to 3437
Apr 12 14:55:50 srv-01-szd pvedaemon[3437]: received signal HUP
Apr 12 14:55:50 srv-01-szd pvedaemon[3437]: server closing
Apr 12 14:55:50 srv-01-szd pvedaemon[3437]: server shutdown (restart)
Apr 12 14:55:50 srv-01-szd pvedaemon[15325]: worker exit
Apr 12 14:55:50 srv-01-szd pvedaemon[15327]: worker exit
Apr 12 14:55:50 srv-01-szd systemd[1]: Reloaded PVE API Daemon.
Apr 12 14:55:50 srv-01-szd systemd[1]: Reloading PVE Status Daemon.
Apr 12 14:55:50 srv-01-szd pvestatd[26389]: send HUP to 3412
Apr 12 14:55:50 srv-01-szd pvestatd[3412]: received signal HUP
Apr 12 14:55:50 srv-01-szd pvestatd[3412]: server shutdown (restart)
Apr 12 14:55:50 srv-01-szd systemd[1]: Reloaded PVE Status Daemon.
Apr 12 14:55:50 srv-01-szd systemd[1]: Reloading PVE API Proxy Server.
Apr 12 14:55:51 srv-01-szd pvedaemon[26385]: worker exit
Apr 12 14:55:51 srv-01-szd pvedaemon[3437]: restarting server
Apr 12 14:55:51 srv-01-szd pvedaemon[3437]: worker 15326 finished
Apr 12 14:55:51 srv-01-szd pvedaemon[3437]: worker 15327 finished
Apr 12 14:55:51 srv-01-szd pvedaemon[3437]: worker 15325 finished
Apr 12 14:55:51 srv-01-szd pvedaemon[3437]: starting 3 worker(s)
Apr 12 14:55:51 srv-01-szd pvedaemon[3437]: worker 26398 started
Apr 12 14:55:51 srv-01-szd pvedaemon[3437]: worker 26399 started
Apr 12 14:55:51 srv-01-szd pvedaemon[3437]: worker 26400 started
Apr 12 14:55:51 srv-01-szd watchdog-mux[1605]: client watchdog expired - disable watchdog updates
Apr 12 14:55:51 srv-01-szd pvestatd[3412]: restarting server
Apr 12 14:55:51 srv-01-szd pveproxy[26394]: send HUP to 40759
Apr 12 14:55:51 srv-01-szd pveproxy[40759]: received signal HUP
Apr 12 14:55:51 srv-01-szd pveproxy[40759]: server closing
Apr 12 14:55:51 srv-01-szd pveproxy[40759]: server shutdown (restart)
Apr 12 14:55:51 srv-01-szd pveproxy[15355]: worker exit
Apr 12 14:55:51 srv-01-szd pveproxy[15353]: worker exit
Apr 12 14:55:51 srv-01-szd systemd[1]: Reloaded PVE API Proxy Server.
Apr 12 14:55:51 srv-01-szd systemd[1]: Reloading PVE SPICE Proxy Server.
Apr 12 14:55:52 srv-01-szd pveproxy[40759]: Using '/etc/pve/local/pveproxy-ssl.pem' as certificate for the web interface.
Apr 12 14:55:52 srv-01-szd pveproxy[40759]: restarting server
Apr 12 14:55:52 srv-01-szd pveproxy[40759]: worker 15355 finished
Apr 12 14:55:52 srv-01-szd pveproxy[40759]: worker 15353 finished
Apr 12 14:55:52 srv-01-szd pveproxy[40759]: worker 15354 finished
Apr 12 14:55:52 srv-01-szd pveproxy[40759]: starting 3 worker(s)
Apr 12 14:55:52 srv-01-szd spiceproxy[26409]: send HUP to 40788
Apr 12 14:55:52 srv-01-szd spiceproxy[40788]: received signal HUP
Apr 12 14:55:52 srv-01-szd spiceproxy[40788]: server closing
Apr 12 14:55:52 srv-01-szd spiceproxy[40788]: server shutdown (restart)
Apr 12 14:55:52 srv-01-szd pveproxy[40759]: worker 26418 started
Apr 12 14:55:52 srv-01-szd spiceproxy[16296]: worker exit
Apr 12 14:55:52 srv-01-szd pveproxy[40759]: worker 26419 started
Apr 12 14:55:52 srv-01-szd pveproxy[40759]: worker 26420 started
Apr 12 14:55:52 srv-01-szd systemd[1]: Reloaded PVE SPICE Proxy Server.
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [status] notice: update cluster info (cluster name nmscluster, version = 4)
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [dcdb] notice: start cluster connection
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [status] notice: start cluster connection
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [status] notice: node has quorum
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [dcdb] notice: members: 1/6099, 2/3032, 3/25030, 4/3147
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [dcdb] notice: starting data syncronisation
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [dcdb] notice: received sync request (epoch 1/6099/00000003)
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [status] notice: members: 1/6099, 2/3032, 3/25030, 4/3147
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [status] notice: starting data syncronisation
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [status] notice: received sync request (epoch 1/6099/00000003)
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [dcdb] notice: received all states
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [dcdb] notice: leader is 2/3032
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [dcdb] notice: synced members: 2/3032, 3/25030, 4/3147
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [dcdb] notice: waiting for updates from leader
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [status] notice: received all states
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [status] notice: all data is up to date
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [dcdb] notice: update complete - trying to commit (got 10 inode updates)
Apr 12 14:55:53 srv-01-szd pmxcfs[6099]: [dcdb] notice: all data is up to date
Apr 12 14:55:53 srv-01-szd spiceproxy[40788]: restarting server
Apr 12 14:55:53 srv-01-szd spiceproxy[40788]: worker 16296 finished
Apr 12 14:55:53 srv-01-szd spiceproxy[40788]: starting 1 worker(s)
Apr 12 14:55:53 srv-01-szd spiceproxy[40788]: worker 26962 started
Apr 12 14:55:53 srv-01-szd pveproxy[26405]: worker exit
Apr 12 14:55:55 srv-01-szd pve-ha-lrm[7431]: successfully acquired lock 'ha_agent_srv-01-szd_lock'
Apr 12 14:55:55 srv-01-szd pve-ha-lrm[7431]: status change lost_agent_lock => active
Apr 12 14:55:55 srv-01-szd watchdog-mux[1605]: exit watchdog-mux with active connections
Apr 12 14:55:55 srv-01-szd kernel: [94765.098998] watchdog watchdog0: watchdog did not stop!
Apr 12 14:56:01 srv-01-szd cron[3043]: (*system*zfsutils-linux) RELOAD (/etc/cron.d/zfsutils-linux)
Apr 12 14:56:01 srv-01-szd pve-ha-crm[7657]: status change wait_for_quorum => slave
REBOOT
This is clearly not optimal.
As a sidenote we upgrade the system in three steps, to prevent the cluster from falling apart, but it seems more granularity is needed... *sigh*