Proxmox cluster lost synchronization

azhv

New Member
Oct 5, 2020
14
0
1
33
Hello,
Today our cluster lost synchronization. Most of the nodes were shown as offline or unknown. The nodes were up but every node could see only itself and few other nodes.
Restarting the pve-cluster and corosync didn't help so we brought everything down and started them one by one.
For most of the hosts this worked. However if we start the last few of them they seem to be causing the other nodes to crash.
The log says that synchronization is started and few minutes later everything crashes down the same way. Disconnecting the hosts fixes the issue after few minutes.
Additionally few tasks for starting all VMs and containers and few tasks for starting single VMs left rinning and can't be stopped.

Any ideas how to bring the last hosts up are welcome.
Part from the syslog when the failure occures:
Oct 11 16:40:50 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 212 1a3 255 257 21b
Oct 11 16:40:50 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 21b
Oct 11 16:40:50 esofiman41evsfs pmxcfs[1878]: [dcdb] notice: received sync request (epoch 1/1878/00000028)
Oct 11 16:40:50 esofiman41evsfs pmxcfs[1878]: [dcdb] notice: received sync request (epoch 1/1878/00000029)
Oct 11 16:40:50 esofiman41evsfs corosync[69291]: [MAIN ] Q empty, queued:0 sent:283.
Oct 11 16:40:52 esofiman41evsfs corosync[69291]: [KNET ] rx: host: 11 link: 0 is up
Oct 11 16:40:52 esofiman41evsfs corosync[69291]: [KNET ] host: host: 11 (passive) best link: 0 (pri: 1)
Oct 11 16:40:55 esofiman41evsfs corosync[69291]: [KNET ] rx: host: 10 link: 0 is up
Oct 11 16:40:55 esofiman41evsfs corosync[69291]: [KNET ] host: host: 10 (passive) best link: 0 (pri: 1)
Oct 11 16:40:57 esofiman41evsfs corosync[69291]: [TOTEM ] Token has not been received in 320 ms
Oct 11 16:40:58 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 223 224 1a3
Oct 11 16:41:05 esofiman41evsfs corosync[69291]: [TOTEM ] Token has not been received in 326 ms
Oct 11 16:41:06 esofiman41evsfs pvedaemon[80958]: worker exit
Oct 11 16:41:06 esofiman41evsfs pvedaemon[1956]: worker 80958 finished
Oct 11 16:41:06 esofiman41evsfs pvedaemon[1956]: starting 1 worker(s)
Oct 11 16:41:06 esofiman41evsfs pvedaemon[1956]: worker 13219 started
Oct 11 16:41:08 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 227 228 229 22a 1a3 26a 269
Oct 11 16:41:10 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 229 22a 1a3 1a5 1c1 1cb
Oct 11 16:41:14 esofiman41evsfs corosync[69291]: [KNET ] link: host: 10 link: 0 is down
Oct 11 16:41:14 esofiman41evsfs corosync[69291]: [KNET ] host: host: 10 (passive) best link: 0 (pri: 1)
Oct 11 16:41:14 esofiman41evsfs corosync[69291]: [KNET ] host: host: 10 has no active links
Oct 11 16:41:14 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 1a3 1a5
Oct 11 16:41:16 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 1a3 1a5 1c1 27e 27d
Oct 11 16:41:19 esofiman41evsfs corosync[69291]: [KNET ] rx: host: 10 link: 0 is up
Oct 11 16:41:19 esofiman41evsfs corosync[69291]: [KNET ] host: host: 10 (passive) best link: 0 (pri: 1)
Oct 11 16:41:21 esofiman41evsfs sshd[13303]: Accepted password for root from 10.134.200.17 port 4029 ssh2
Oct 11 16:41:21 esofiman41evsfs sshd[13303]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct 11 16:41:21 esofiman41evsfs sshd[13304]: Accepted password for root from 10.134.200.17 port 4030 ssh2
Oct 11 16:41:21 esofiman41evsfs sshd[13304]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct 11 16:41:21 esofiman41evsfs systemd-logind[1514]: New session 15 of user root.
Oct 11 16:41:21 esofiman41evsfs systemd[1]: Started Session 15 of user root.
Oct 11 16:41:21 esofiman41evsfs systemd-logind[1514]: New session 16 of user root.
Oct 11 16:41:21 esofiman41evsfs systemd[1]: Started Session 16 of user root.
Oct 11 16:41:22 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 287
Oct 11 16:41:24 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 289
Oct 11 16:41:27 esofiman41evsfs pvedaemon[77877]: <root@pam> successful auth for user 'root@pve'
Oct 11 16:41:28 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 1a5 1c1
Oct 11 16:41:34 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 1de 294
Oct 11 16:41:37 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 298 294
Oct 11 16:41:43 esofiman41evsfs pveproxy[12126]: proxy detected vanished client connection
Oct 11 16:41:45 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 2a0 1a3 1a5 1c1 2a2
Oct 11 16:41:45 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 2a5 2a2
Oct 11 16:41:45 esofiman41evsfs pvedaemon[12130]: <root@pam> successful auth for user 'root@pve'
Oct 11 16:41:47 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 280
Oct 11 16:41:47 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 2b0 2af
Oct 11 16:41:49 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 268
Oct 11 16:41:51 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 2bc
Oct 11 16:41:51 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 21a 223 226 260 261 267
Oct 11 16:41:53 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 2c5
Oct 11 16:41:55 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 2d3 2d4 2d5
Oct 11 16:41:56 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 2d3 2d4
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [dcdb] notice: received all states
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [dcdb] notice: leader is 2/1735
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [dcdb] notice: synced members: 2/1735, 7/1841, 11/1273, 12/1722
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [dcdb] notice: waiting for updates from leader
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [dcdb] notice: dfsm_deliver_queue: queue length 12
Oct 11 16:41:56 esofiman41evsfs corosync[69291]: [MAIN ] Q empty, queued:0 sent:159.
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [status] notice: received all states
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [status] notice: all data is up to date
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [status] notice: dfsm_deliver_queue: queue length 14017
Oct 11 16:41:56 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 2d4
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/esofiauto36eteams/local: -1
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/esofiauto36eteams/datastore-esofiauto36: -1
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/esofiauto36eteams/datastore-esofiauto36: -1
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/esofiauto36eteams/local: -1
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-node/micro26: -1
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-node/micro26: /var/lib/rrdcached/db/pve2-node/micro26: illegal attempt to update using time 1633958683 when last update time is 1633958713 (minimum one second step)
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/90002: -1
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-vm/90002: /var/lib/rrdcached/db/pve2-vm/90002: illegal attempt to update using time 1633958683 when last update time is 1633958713 (minimum one second step)
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/60002: -1
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-vm/60002: /var/lib/rrdcached/db/pve2-vm/60002: illegal attempt to update using time 1633958683 when last update time is 1633958713 (minimum one second step)
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/80001: -1
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-vm/80001: /var/lib/rrdcached/db/pve2-vm/80001: illegal attempt to update using time 1633958683 when last update time is 1633958713 (minimum one second step)
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/80012: -1
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-vm/80012: /var/lib/rrdcached/db/pve2-vm/80012: illegal attempt to update using time 1633958683 when last update time is 1633958713 (minimum one second step)
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/80009: -1
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-vm/80009: /var/lib/rrdcached/db/pve2-vm/80009: illegal attempt to update using time 1633958683 when last update time is 1633958713 (minimum one second step)
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/80004: -1
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-vm/80004: /var/lib/rrdcached/db/pve2-vm/80004: illegal attempt to update using time 1633958683 when last update time is 1633958713 (minimum one second step)
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/80008: -1
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-vm/80008: /var/lib/rrdcached/db/pve2-vm/80008: illegal attempt to update using time 1633958683 when last update time is 1633958713 (minimum one second step)
proxmox-ve: 6.2-1 (running kernel: 5.4.34-1-pve)
pve-manager: 6.2-4 (running version: 6.2-4/9824574a)
pve-kernel-5.4: 6.2-1
pve-kernel-helper: 6.2-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.3
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-2
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-1
pve-cluster: 6.1-8
pve-container: 3.1-5
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-2
pve-qemu-kvm: 5.0.0-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
 
Last edited:
There seems to be an issue with the time:
Oct 11 16:41:56 esofiman41evsfs pmxcfs[1878]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-node/micro26: /var/lib/rrdcached/db/pve2-node/micro26: illegal attempt to update using time 1633958683 when last update time is 1633958713 (minimum one second step)
Make sure your nodes are synced up, otherwise it can lead to behavior like this.
 
Hello and thanks for the replay.
I made sure the time was correct but when I started one of the nodes everything crashed once again.
Below is the syslog of one of the working nodes for few minutes before the crash.
Virtual Environment 6.2-4
Node 'esofiman41evsfs'
Oct 11 18:17:00 esofiman41evsfs systemd[1]: Starting Proxmox VE replication runner...
Oct 11 18:17:01 esofiman41evsfs CRON[50725]: pam_unix(cron:session): session opened for user root by (uid=0)
Oct 11 18:17:01 esofiman41evsfs CRON[50726]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Oct 11 18:17:01 esofiman41evsfs CRON[50725]: pam_unix(cron:session): session closed for user root
Oct 11 18:17:04 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 59 43 45 54 55 56 58 5c
Oct 11 18:17:10 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 43 54 56 58 59 5d 5c
Oct 11 18:17:16 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 62 43 54 56 58 5c
Oct 11 18:17:16 esofiman41evsfs corosync[69291]: [KNET ] link: host: 11 link: 0 is down
Oct 11 18:17:16 esofiman41evsfs corosync[69291]: [KNET ] link: host: 10 link: 0 is down
Oct 11 18:17:16 esofiman41evsfs corosync[69291]: [KNET ] host: host: 11 (passive) best link: 0 (pri: 1)
Oct 11 18:17:16 esofiman41evsfs corosync[69291]: [KNET ] host: host: 11 has no active links
Oct 11 18:17:16 esofiman41evsfs corosync[69291]: [KNET ] host: host: 10 (passive) best link: 0 (pri: 1)
Oct 11 18:17:16 esofiman41evsfs corosync[69291]: [KNET ] host: host: 10 has no active links
Oct 11 18:17:16 esofiman41evsfs corosync[69291]: [KNET ] link: host: 6 link: 0 is down
Oct 11 18:17:16 esofiman41evsfs corosync[69291]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Oct 11 18:17:16 esofiman41evsfs corosync[69291]: [KNET ] host: host: 6 has no active links
Oct 11 18:17:18 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 56 58 62 66 68 69
Oct 11 18:17:18 esofiman41evsfs pmxcfs[1878]: [dcdb] notice: received all states
Oct 11 18:17:18 esofiman41evsfs pmxcfs[1878]: [dcdb] notice: leader is 1/1878
Oct 11 18:17:18 esofiman41evsfs pmxcfs[1878]: [dcdb] notice: synced members: 1/1878, 2/1735, 3/1202, 7/1841, 9/1865, 10/1261, 11/1273, 12/1722, 13/1174
Oct 11 18:17:18 esofiman41evsfs pmxcfs[1878]: [dcdb] notice: start sending inode updates
Oct 11 18:17:18 esofiman41evsfs pmxcfs[1878]: [dcdb] notice: sent all (14) updates
Oct 11 18:17:18 esofiman41evsfs pmxcfs[1878]: [dcdb] notice: all data is up to date
Oct 11 18:17:18 esofiman41evsfs pmxcfs[1878]: [dcdb] notice: dfsm_deliver_queue: queue length 9
Oct 11 18:17:21 esofiman41evsfs corosync[69291]: [KNET ] rx: host: 10 link: 0 is up
Oct 11 18:17:21 esofiman41evsfs corosync[69291]: [KNET ] rx: host: 11 link: 0 is up
Oct 11 18:17:21 esofiman41evsfs corosync[69291]: [KNET ] host: host: 10 (passive) best link: 0 (pri: 1)
Oct 11 18:17:21 esofiman41evsfs corosync[69291]: [KNET ] host: host: 11 (passive) best link: 0 (pri: 1)
Oct 11 18:17:21 esofiman41evsfs corosync[69291]: [KNET ] rx: host: 6 link: 0 is up
Oct 11 18:17:21 esofiman41evsfs corosync[69291]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Oct 11 18:17:24 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 56 58 66 68
Oct 11 18:17:26 esofiman41evsfs pmxcfs[1878]: [status] notice: received all states
Oct 11 18:17:26 esofiman41evsfs pmxcfs[1878]: [status] notice: all data is up to date
Oct 11 18:17:26 esofiman41evsfs pmxcfs[1878]: [status] notice: dfsm_deliver_queue: queue length 189
Oct 11 18:17:26 esofiman41evsfs pmxcfs[1878]: [status] notice: received log
Oct 11 18:17:26 esofiman41evsfs pmxcfs[1878]: [main] notice: ignore duplicate
Oct 11 18:17:30 esofiman41evsfs pmxcfs[1878]: [status] notice: received log
Oct 11 18:17:30 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 56 58 66 68 73
Oct 11 18:17:36 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 7a 7b 68 73 77 78 79 66
Oct 11 18:17:42 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 77 78 7a 7b 7e 66
Oct 11 18:17:48 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 77 7b 7e
Oct 11 18:17:48 esofiman41evsfs pmxcfs[1878]: [status] notice: received log
Oct 11 18:17:52 esofiman41evsfs pmxcfs[1878]: [dcdb] notice: dfsm_deliver_queue: queue length 1
Oct 11 18:17:52 esofiman41evsfs pvesr[50703]: error during cfs-locked 'file-replication_cfg' operation: got lock request timeout
Oct 11 18:17:52 esofiman41evsfs systemd[1]: pvesr.service: Main process exited, code=exited, status=17/n/a
Oct 11 18:17:52 esofiman41evsfs systemd[1]: pvesr.service: Failed with result 'exit-code'.
Oct 11 18:17:52 esofiman41evsfs systemd[1]: Failed to start Proxmox VE replication runner.
Oct 11 18:17:52 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 7b 90
Oct 11 18:17:54 esofiman41evsfs corosync[69291]: [KNET ] link: host: 11 link: 0 is down
Oct 11 18:17:54 esofiman41evsfs corosync[69291]: [KNET ] link: host: 10 link: 0 is down
Oct 11 18:17:54 esofiman41evsfs corosync[69291]: [KNET ] host: host: 11 (passive) best link: 0 (pri: 1)
Oct 11 18:17:54 esofiman41evsfs corosync[69291]: [KNET ] host: host: 11 has no active links
Oct 11 18:17:54 esofiman41evsfs corosync[69291]: [KNET ] host: host: 10 (passive) best link: 0 (pri: 1)
Oct 11 18:17:54 esofiman41evsfs corosync[69291]: [KNET ] host: host: 10 has no active links
Oct 11 18:17:58 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 90 94 95
Oct 11 18:17:58 esofiman41evsfs corosync[69291]: [KNET ] rx: host: 11 link: 0 is up
Oct 11 18:17:58 esofiman41evsfs corosync[69291]: [KNET ] rx: host: 10 link: 0 is up
Oct 11 18:17:58 esofiman41evsfs corosync[69291]: [KNET ] host: host: 11 (passive) best link: 0 (pri: 1)
Oct 11 18:17:58 esofiman41evsfs corosync[69291]: [KNET ] host: host: 10 (passive) best link: 0 (pri: 1)
Oct 11 18:18:00 esofiman41evsfs systemd[1]: Starting Proxmox VE replication runner...
Oct 11 18:18:04 esofiman41evsfs corosync[69291]: [TOTEM ] Token has not been received in 292 ms
Oct 11 18:18:05 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 99 9c 9b 9d 9e 9f a0
Oct 11 18:18:30 esofiman41evsfs corosync[69291]: [TOTEM ] Token has not been received in 306 ms
Oct 11 18:18:37 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 15 16 17 18 19 1a 1b 1c 1d 1e 2c 2 4 5 6 8 9 a b 1f 23 24 25 26 27 28 29 2a 2b 2d
Oct 11 18:18:41 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 30 31 32 24 2c 23 9 a b e 11 12 13 14 15 16 17 18 19 1a 25 26 27 28 29 2a 2b 2d 2f
Oct 11 18:18:48 esofiman41evsfs corosync[69291]: [TOTEM ] Token has not been received in 305 ms
Oct 11 18:18:49 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 2f 38 39 3a 47 48 49 4a 4b 30 31 32 f 10 11 12 13 14 15 16 26 27 28 29 2a 2b 2c 2d 4c 4d
Oct 11 18:18:55 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 26 27 28 29 2a 2b 2c 2d 4c 4d 4e 38 39 65 66 67 f 10 12 13 30 31 32 3a 47 48 49 4a 4b
Oct 11 18:18:59 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 39 65 66 67 30 31 32 47 48 49 4a 4b 73 74 75 77 78 79 27 28 72 29 2a 2b 2c 2d 38 4c 4d 4e
Oct 11 18:19:03 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 73 74 75 77 78 79 27 28 2a 2b 2c 38 4c 4d 4e 7a 76 39 65 92 30 31 32 47 48 49 4a 4b 61 67
Oct 11 18:19:07 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 4d 4e 7a 39 65 92 30 31 32 47 49 4a 4b a0 a1 74 75 9f a2 a3 15 27 28 2a 2b 2c 38 4c 77 78
Oct 11 18:19:11 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: a0 a1 74 75 a2 27 28 2b 2c 38 4c 78 79 a4 a5 a6 a7 4d 4e 7a 16 30 31 32 39 47 49 4a 4b 65
Oct 11 18:19:17 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: a4 a5 a6 a7 4d 4e 7a 30 31 47 49 4a 4b 65 92 74 a1 c1 17 18 27 28 2b 2c 38 4c 75 78 79
Oct 11 18:19:48 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 15 16 17 18 19 1a 1b 1c 1d 1e 22 2 4 5 7 8 9 a c d 23 24 25 26 27 28 29 2a 2b 2c
Oct 11 18:19:54 esofiman41evsfs corosync[69291]: [TOTEM ] Token has not been received in 305 ms
Oct 11 18:19:56 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 30 31 32 23 24 25 9 a c d 11 12 13 14 15 5 16 17 18 19 27 28 29 2a 2b 2c 2d 2f
Oct 11 18:20:02 esofiman41evsfs corosync[69291]: [TOTEM ] Token has not been received in 303 ms
Oct 11 18:20:04 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 2f 38 39 3a 47 48 49 4a 4b 4c 30 31 32 e 11 5 12 13 14 15 24 25 27 28 29 2a 2b 2c 2d 4d
Oct 11 18:20:10 esofiman41evsfs corosync[69291]: [TOTEM ] Token has not been received in 6451 ms
Oct 11 18:20:36 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 15 16 17 18 19 1a 1b 1c 1d 1e 21 22 2 4 5 6 7 8 9 a 23 24 25 26 27 28 29 2a 2b 2c
Oct 11 18:20:36 esofiman41evsfs pvedaemon[46943]: <root@pam> successful auth for user 'root@pve'
Oct 11 18:20:42 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 30 31 32 24 25 23 7 8 9 a c 12 13 14 15 16 17 18 19 1a 27 28 29 2a 2b 2c 2d 2f
Oct 11 18:20:42 esofiman41evsfs pveproxy[37756]: worker exit
Oct 11 18:20:42 esofiman41evsfs pveproxy[1967]: worker 37756 finished
Oct 11 18:20:42 esofiman41evsfs pveproxy[1967]: starting 1 worker(s)
Oct 11 18:20:42 esofiman41evsfs pveproxy[1967]: worker 51995 started
Oct 11 18:20:44 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 2f 38 39 3a 47 48 49 4a 4b 4c 30 31 32 10 11 12 13 14 15 16 25 27 28 29 2a 2b 2c 2d 4d 4e
Oct 11 18:20:50 esofiman41evsfs corosync[69291]: [TOTEM ] Token has not been received in 308 ms
Oct 11 18:20:52 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 25 27 28 29 2a 2b 2c 2d 4d 4e 38 39 65 66 67 10 11 12 13 14 31 32 3a 47 48 49 4a 4b 4c
Oct 11 18:20:56 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 31 32 47 48 49 4a 4b 4c 73 74 75 77 78 79 76 25 27 28 7a 92 29 2b 2c 2d 38 39 4d 4e 67
Oct 11 18:21:02 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 78 79 27 28 7a 92 2b 2c 2d 38 39 4d 4e 67 31 32 47 9f a0 a1 a2 49 4a 4b 4c 73 74 75 77 a3
Oct 11 18:21:08 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 67 32 47 a0 a1 a2 49 4a 4b 4c 74 75 77 a4 a5 a6 a7 27 78 79 14 28 2b 2c 2d 38 39 4d 4e 7a
Oct 11 18:21:12 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: a4 a5 a6 a7 27 79 2b 2c 38 39 4d 4e 7a 92 32 47 67 c1 c2 15 16 49 4a 4b 4c 74 75 77 a1 a2
Oct 11 18:21:44 esofiman41evsfs corosync[69291]: [TOTEM ] Token has not been received in 308 ms
Oct 11 18:22:09 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 15 16 17 18 19 1a 1b 1c 1d 1e 22 2 3 4 5 8 9 b c d 23 24 25 26 27 28 29 2a 2b 2c
Oct 11 18:22:13 esofiman41evsfs corosync[69291]: [KNET ] link: host: 13 link: 0 is down
Oct 11 18:22:13 esofiman41evsfs corosync[69291]: [KNET ] host: host: 13 (passive) best link: 0 (pri: 1)
Oct 11 18:22:13 esofiman41evsfs corosync[69291]: [KNET ] host: host: 13 has no active links
Oct 11 18:22:15 esofiman41evsfs corosync[69291]: [TOTEM ] Token has not been received in 2243 ms
Oct 11 18:22:33 esofiman41evsfs corosync[69291]: [TOTEM ] Token has not been received in 20186 ms
Oct 11 18:22:33 esofiman41evsfs pvedaemon[40517]: <root@pve> starting task UPID:esofiman41evsfs:0000CDA1:00212A12:616456B9:vncproxy:44001:root@pve:
Oct 11 18:22:33 esofiman41evsfs pvedaemon[52641]: starting vnc proxy UPID:esofiman41evsfs:0000CDA1:00212A12:616456B9:vncproxy:44001:root@pve:
Oct 11 18:22:47 esofiman41evsfs corosync[69291]: [KNET ] rx: host: 13 link: 0 is up
Oct 11 18:22:47 esofiman41evsfs corosync[69291]: [KNET ] host: host: 13 (passive) best link: 0 (pri: 1)
Oct 11 18:23:05 esofiman41evsfs pveproxy[51995]: Clearing outdated entries from certificate cache
Oct 11 18:23:11 esofiman41evsfs corosync[69291]: [KNET ] link: host: 11 link: 0 is down
Oct 11 18:23:11 esofiman41evsfs corosync[69291]: [KNET ] link: host: 10 link: 0 is down
Oct 11 18:23:11 esofiman41evsfs corosync[69291]: [KNET ] link: host: 6 link: 0 is down
Oct 11 18:23:11 esofiman41evsfs corosync[69291]: [KNET ] host: host: 11 (passive) best link: 0 (pri: 1)
Oct 11 18:23:11 esofiman41evsfs corosync[69291]: [KNET ] host: host: 11 has no active links
Oct 11 18:23:11 esofiman41evsfs corosync[69291]: [KNET ] host: host: 10 (passive) best link: 0 (pri: 1)
Oct 11 18:23:11 esofiman41evsfs corosync[69291]: [KNET ] host: host: 10 has no active links
Oct 11 18:23:11 esofiman41evsfs corosync[69291]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Oct 11 18:23:11 esofiman41evsfs corosync[69291]: [KNET ] host: host: 6 has no active links
Oct 11 18:23:15 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 6 a b 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 2 4 5 7 8 9 c d 1f 23 24 25 26 27
Oct 11 18:23:15 esofiman41evsfs corosync[69291]: [KNET ] rx: host: 11 link: 0 is up
Oct 11 18:23:15 esofiman41evsfs corosync[69291]: [KNET ] rx: host: 10 link: 0 is up
Oct 11 18:23:15 esofiman41evsfs corosync[69291]: [KNET ] host: host: 11 (passive) best link: 0 (pri: 1)
Oct 11 18:23:15 esofiman41evsfs corosync[69291]: [KNET ] host: host: 10 (passive) best link: 0 (pri: 1)
Oct 11 18:23:15 esofiman41evsfs corosync[69291]: [KNET ] rx: host: 6 link: 0 is up
Oct 11 18:23:15 esofiman41evsfs corosync[69291]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Oct 11 18:23:21 esofiman41evsfs corosync[69291]: [TOTEM ] Token has not been received in 300 ms
Oct 11 18:23:23 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 31 32 b 11 12 24 25 26 23 4 5 7 8 9 c d 13 14 16 17 2 18 19 1a 1b 27 28 29 2a 2b
Oct 11 18:23:27 esofiman41evsfs corosync[69291]: [TOTEM ] Retransmit List: 38 39 3a 8 24 25 31 32 4 5 9 a b c d 10 11 12 13 14 16 17 18 19 1a 28 29 2a 2b 2c
 
And this is the log from the failed node.
Oct 11 18:16:23 esofiauto28emaster kernel: [ 12.009678] vmbr0: port 1(eno1np0) entered forwarding state
Oct 11 18:16:23 esofiauto28emaster kernel: [ 12.244433] bpfilter: Loaded bpfilter_umh pid 1633
Oct 11 18:16:24 esofiauto28emaster kernel: [ 12.976147] IPv6: ADDRCONF(NETDEV_CHANGE): vmbr0: link becomes ready
Oct 11 18:16:24 esofiauto28emaster kernel: [ 13.367890] NFSD: Using UMH upcall client tracking operations.
Oct 11 18:16:24 esofiauto28emaster kernel: [ 13.367894] NFSD: starting 90-second grace period (net f00000d8)
Oct 11 18:16:25 esofiauto28emaster kernel: [ 14.605682] sctp: Hash tables configured (bind 4096/4096)
Oct 11 18:16:32 esofiauto28emaster kernel: [ 21.023477] FS-Cache: Loaded
Oct 11 18:16:32 esofiauto28emaster kernel: [ 21.042970] FS-Cache: Netfs 'nfs' registered for caching
Oct 11 18:16:32 esofiauto28emaster kernel: [ 21.134026] NFS: Registering the id_resolver key type
Oct 11 18:16:32 esofiauto28emaster kernel: [ 21.134033] Key type id_resolver registered
Oct 11 18:16:32 esofiauto28emaster kernel: [ 21.134034] Key type id_legacy registered
Oct 11 18:16:32 esofiauto28emaster kernel: [ 21.342919] L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.
Oct 11 18:16:32 esofiauto28emaster kernel: [ 21.723448] device tap44001i0 entered promiscuous mode
Oct 11 18:16:32 esofiauto28emaster kernel: [ 21.786209] fwbr44001i0: port 1(fwln44001i0) entered blocking state
Oct 11 18:16:32 esofiauto28emaster kernel: [ 21.786213] fwbr44001i0: port 1(fwln44001i0) entered disabled state
Oct 11 18:16:32 esofiauto28emaster kernel: [ 21.786361] device fwln44001i0 entered promiscuous mode
Oct 11 18:16:32 esofiauto28emaster kernel: [ 21.786477] fwbr44001i0: port 1(fwln44001i0) entered blocking state
Oct 11 18:16:32 esofiauto28emaster kernel: [ 21.786480] fwbr44001i0: port 1(fwln44001i0) entered forwarding state
Oct 11 18:16:32 esofiauto28emaster kernel: [ 21.794326] vmbr0: port 2(fwpr44001p0) entered blocking state
Oct 11 18:16:32 esofiauto28emaster kernel: [ 21.794330] vmbr0: port 2(fwpr44001p0) entered disabled state
Oct 11 18:16:32 esofiauto28emaster kernel: [ 21.794480] device fwpr44001p0 entered promiscuous mode
Oct 11 18:16:32 esofiauto28emaster kernel: [ 21.794599] vmbr0: port 2(fwpr44001p0) entered blocking state
Oct 11 18:16:32 esofiauto28emaster kernel: [ 21.794602] vmbr0: port 2(fwpr44001p0) entered forwarding state
Oct 11 18:16:32 esofiauto28emaster kernel: [ 21.802059] fwbr44001i0: port 2(tap44001i0) entered blocking state
Oct 11 18:16:32 esofiauto28emaster kernel: [ 21.802062] fwbr44001i0: port 2(tap44001i0) entered disabled state
Oct 11 18:16:32 esofiauto28emaster kernel: [ 21.802355] fwbr44001i0: port 2(tap44001i0) entered blocking state
Oct 11 18:16:32 esofiauto28emaster kernel: [ 21.802360] fwbr44001i0: port 2(tap44001i0) entered forwarding state
Oct 11 18:20:15 esofiauto28emaster kernel: [ 244.404402] pve-bridge D 0 2093 2088 0x00000000
Oct 11 18:20:15 esofiauto28emaster kernel: [ 244.404407] Call Trace:
Oct 11 18:20:15 esofiauto28emaster kernel: [ 244.404421] __schedule+0x2e6/0x700
Oct 11 18:20:15 esofiauto28emaster kernel: [ 244.404431] ? filename_parentat.isra.57.part.58+0xf7/0x180
Oct 11 18:20:15 esofiauto28emaster kernel: [ 244.404435] schedule+0x33/0xa0
Oct 11 18:20:15 esofiauto28emaster kernel: [ 244.404441] rwsem_down_write_slowpath+0x2ed/0x4a0
Oct 11 18:20:15 esofiauto28emaster kernel: [ 244.404447] down_write+0x3d/0x40
Oct 11 18:20:15 esofiauto28emaster kernel: [ 244.404452] filename_create+0x8e/0x180
Oct 11 18:20:15 esofiauto28emaster kernel: [ 244.404458] do_mkdirat+0x59/0x110
Oct 11 18:20:15 esofiauto28emaster kernel: [ 244.404463] __x64_sys_mkdir+0x1b/0x20
Oct 11 18:20:15 esofiauto28emaster kernel: [ 244.404472] do_syscall_64+0x57/0x190
Oct 11 18:20:15 esofiauto28emaster kernel: [ 244.404478] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct 11 18:20:15 esofiauto28emaster kernel: [ 244.404482] RIP: 0033:0x7f903e2030d7
Oct 11 18:20:15 esofiauto28emaster kernel: [ 244.404490] Code: Bad RIP value.
Oct 11 18:20:15 esofiauto28emaster kernel: [ 244.404492] RSP: 002b:00007fffc86963c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
Oct 11 18:20:15 esofiauto28emaster kernel: [ 244.404496] RAX: ffffffffffffffda RBX: 000055f300d52260 RCX: 00007f903e2030d7
Oct 11 18:20:15 esofiauto28emaster kernel: [ 244.404498] RDX: 000000000000002c RSI: 00000000000001ff RDI: 000055f3045e7340
Oct 11 18:20:15 esofiauto28emaster kernel: [ 244.404500] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
Oct 11 18:20:15 esofiauto28emaster kernel: [ 244.404502] R10: 0000000000000000 R11: 0000000000000246 R12: 000055f3014282d8
Oct 11 18:20:15 esofiauto28emaster kernel: [ 244.404503] R13: 000055f3045e7340 R14: 000055f30450e778 R15: 00000000000001ff
Oct 11 18:20:15 esofiauto28emaster kernel: [ 244.404650] pvesr D 0 2208 1 0x00000000
Is there any way to manually synchronize the db of the nodes?
 
Looks like a network issue then. Check your NICs, cables and switches.
Also update to the latest PVE 6.4 version at least.

What NIC models are you using? Is the firmware up-to-date?
 
Already checked the network part.
We are planning an update but this will take part when the cluster is back to its normal state.
Is there any way to manually sync the db between two nodes?
And also is there any way to manually kill this tasks - they are stuck forever an cant be stoped from the UI?
tasks.png
 
Do you mean pmxcfs? There's no need for that.
Could you provide the interfaces file (/etc/network/interfaces) for all nodes, as well as the corosync config (/etc/corosync/corosync.conf and /etc/pve/corosync.conf)?
 
Hello,
And thanks for the support. It seems that problem was caused by intermittent high network latency and corosync was unable to sync correctly.
In order to resolve the problem we deployed a private network for the cluster. Then we configured the interfaces and changed /etc/corosync/corosync.conf to use the private interface (done for every node). At the end we started the corosync.service on each node and the cluster was up and running once again.
Is there any way to create "backup" links for corosync?
Thank you.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!