[SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

Marin Bernard · Jul 19, 2019

Hi,

We're in the process of migrating to Proxmox VE 6.0 from version 5. I followed the recommendations on the wiki and updated all nodes to the latest version of PVE 5 (5.4-11), then upgraded our 11-node cluster to Corosync 3. Everything seemed to work except that we're experiencing random disconnects from our nodes every 12-24 hours. Restarting the node fixed the issue until this morning when the whole cluster was shown as disconnected: i tried to restart corosync + pve-cluster on all nodes but it resulted in a fully disconnected cluster with unavailable pmxcfs on all nodes.

pve5to6 was run before the upgrade and did not output any warning except that my corosync config had been customized (one of our nodes yields 2 votes instead of one).

All cluster nodes communicate on the same Ethernet segment using a vLAN dedicated to cluster administration. No firewall or anything related.

Does anyone experience the same issues ? Is there a way to fallback to corosync 2.0 until this has been investigated ?

Many many thanks!

Marin.

fabian · Jul 19, 2019

Marin Bernard said:
Hi,

We're in the process of migrating to Proxmox VE 6.0 from version 5. I followed the recommendations on the wiki and updated all nodes to the latest version of PVE 5 (5.4-11), then upgraded our 11-node cluster to Corosync 3. Everything seemed to work except that we're experiencing random disconnects from our nodes every 12-24 hours. Restarting the node fixed the issue until this morning when the whole cluster was shown as disconnected: i tried to restart corosync + pve-cluster on all nodes but it resulted in a fully disconnected cluster with unavailable pmxcfs on all nodes.

pve5to6 was run before the upgrade and did not output any warning except that my corosync config had been customized (one of our nodes yields 2 votes instead of one).

All cluster nodes communicate on the same Ethernet segment using a vLAN dedicated to cluster administration. No firewall or anything related.

Does anyone experience the same issues ? Is there a way to fallback to corosync 2.0 until this has been investigated ?

Many many thanks!

Marin.

could you provide your corosync.conf and logs of both pve-cluster and corosync services around the outages you experienced?

Marin Bernard · Jul 19, 2019

Hi,

Here is corosync.conf. All node names are resolvable to valid IP addresses (present in /etc/hosts)

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: ass-pve-1203-01
    nodeid: 9
    quorum_votes: 1
    ring0_addr: 10.6.100.199
  }
  node {
    name: ass-pve-1308-01
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.6.100.196
  }
  node {
    name: ass-pve-1308-02
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.6.100.197
  }
  node {
    name: ass-pve-1404-01
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 10.6.100.200
  }
  node {
    name: ass-pve-1710-01
    nodeid: 1
    quorum_votes: 2
    ring0_addr: 10.6.100.194
  }
  node {
    name: ass-pve-1710-02
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 10.6.100.198
  }
  node {
    name: ass-pve-1710-03
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.6.100.195
  }
  node {
    name: ass-pve-1805-01
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.6.100.193
  }
  node {
    name: cds-pve-1603-01
    nodeid: 11
    quorum_votes: 1
    ring0_addr: 10.6.100.204
  }
  node {
    name: ros-pve-1403-01
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 10.6.100.202
  }
  node {
    name: ros-pve-1606-01
    nodeid: 10
    quorum_votes: 1
    ring0_addr: 10.6.100.203
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: pep06-ros-mgmt
  config_version: 11
  interface {
    bindnetaddr: 10.6.100.194
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

I restarted most of the nodes hoping to get the cluster back to a responsive state, without success. The corosync daemon seems to give up forming a quorum a few seconds after it starts. Restarting it does not help. Here is what's shown in the log:

Code:

Jul 19 10:28:57 ass-pve-1308-01 systemd[1]: Starting Corosync Cluster Engine...
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [MAIN  ] Corosync Cluster Engine 3.0.2-dirty starting up
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [MAIN  ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [MAIN  ] Please migrate config file to nodelist.
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [TOTEM ] Initializing transport (Kronosnet).
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [TOTEM ] kronosnet crypto initialized: aes256/sha256
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [TOTEM ] totemknet initialized
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [QB    ] server name: cmap
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [QB    ] server name: cfg
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [QB    ] server name: cpg
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [WD    ] Watchdog not enabled by configuration
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [WD    ] resource load_15min missing a recovery key.
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [WD    ] resource memory_used missing a recovery key.
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [WD    ] no resources configured.
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [QUORUM] Using quorum provider corosync_votequorum
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [QB    ] server name: votequorum
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [QB    ] server name: quorum
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 9 (passive) best link: 0 (pri: 0)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 9 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 9 (passive) best link: 0 (pri: 1)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 9 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 9 (passive) best link: 0 (pri: 1)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 9 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 10 (passive) best link: 0 (pri: 1)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 10 has no active links
Jul 19 10:28:57 ass-pve-1308-01 systemd[1]: Started Corosync Cluster Engine.
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 10 (passive) best link: 0 (pri: 1)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 10 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [TOTEM ] A new membership (4:151520) was formed. Members joined: 4
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 10 (passive) best link: 0 (pri: 1)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 10 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 0)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 5 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 5 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 5 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [CPG   ] downlist left_list: 0 received
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 7 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [QUORUM] Members[1]: 4
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 7 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 7 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 1 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 1 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 1 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 0)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 6 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 6 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 6 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 3 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 3 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 3 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 0)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 2 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 2 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 2 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 11 (passive) best link: 0 (pri: 0)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 11 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 11 (passive) best link: 0 (pri: 1)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 11 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 11 (passive) best link: 0 (pri: 1)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 11 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 0)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 8 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 8 has no active links
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
Jul 19 10:28:57 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 8 has no active links
Jul 19 10:29:00 ass-pve-1308-01 systemd[1]: Starting Proxmox VE replication runner...
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] rx: host: 11 link: 0 is up
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] rx: host: 3 link: 0 is up
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] rx: host: 8 link: 0 is up
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] rx: host: 2 link: 0 is up
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] rx: host: 6 link: 0 is up
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] rx: host: 10 link: 0 is up
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] rx: host: 5 link: 0 is up
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] rx: host: 9 link: 0 is up
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] rx: host: 1 link: 0 is up
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 11 (passive) best link: 0 (pri: 1)
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 10 (passive) best link: 0 (pri: 1)
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 9 (passive) best link: 0 (pri: 1)
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] pmtud: PMTUD link change for host: 8 link: 0 from 470 to 1366
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] pmtud: PMTUD link change for host: 11 link: 0 from 470 to 1366
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 470 to 1366
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 470 to 1366
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] pmtud: PMTUD link change for host: 6 link: 0 from 470 to 1366
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 470 to 1366
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] pmtud: PMTUD link change for host: 5 link: 0 from 470 to 1366
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] pmtud: PMTUD link change for host: 10 link: 0 from 470 to 1366
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] pmtud: PMTUD link change for host: 9 link: 0 from 470 to 1366
Jul 19 10:29:01 ass-pve-1308-01 corosync[13248]:   [KNET  ] pmtud: Global data MTU changed to: 1366
Jul 19 10:29:01 ass-pve-1308-01 pvesr[13337]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 19 10:29:02 ass-pve-1308-01 pmxcfs[6631]: [status] notice: update cluster info (cluster name  pep06-ros-mgmt, version = 11)
Jul 19 10:29:02 ass-pve-1308-01 pvesr[13337]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 19 10:29:03 ass-pve-1308-01 pvesr[13337]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 19 10:29:04 ass-pve-1308-01 pvesr[13337]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 19 10:29:05 ass-pve-1308-01 pvesr[13337]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 19 10:29:06 ass-pve-1308-01 pvesr[13337]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 19 10:29:07 ass-pve-1308-01 pvesr[13337]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 19 10:29:08 ass-pve-1308-01 pvesr[13337]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 19 10:29:09 ass-pve-1308-01 pvesr[13337]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 19 10:29:10 ass-pve-1308-01 pvesr[13337]: error with cfs lock 'file-replication_cfg': no quorum!
Jul 19 10:29:10 ass-pve-1308-01 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Jul 19 10:29:10 ass-pve-1308-01 systemd[1]: Failed to start Proxmox VE replication runner.
Jul 19 10:29:10 ass-pve-1308-01 systemd[1]: pvesr.service: Unit entered failed state.
Jul 19 10:29:10 ass-pve-1308-01 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jul 19 10:29:18 ass-pve-1308-01 corosync[13248]:   [TOTEM ] Token has not been received in 247 ms
Jul 19 10:29:29 ass-pve-1308-01 corosync[13248]:   [TOTEM ] Token has not been received in 247 ms
Jul 19 10:29:39 ass-pve-1308-01 corosync[13248]:   [TOTEM ] Token has not been received in 247 ms
Jul 19 10:29:47 ass-pve-1308-01 corosync[13248]:   [TOTEM ] Token has not been received in 266 ms
Jul 19 10:29:59 ass-pve-1308-01 corosync[13248]:   [TOTEM ] Token has not been received in 247 ms
Jul 19 10:30:00 ass-pve-1308-01 systemd[1]: Starting Proxmox VE replication runner...
Jul 19 10:30:00 ass-pve-1308-01 pvesr[14126]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 19 10:30:01 ass-pve-1308-01 pvesr[14126]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 19 10:30:02 ass-pve-1308-01 pvesr[14126]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 19 10:30:03 ass-pve-1308-01 pvesr[14126]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 19 10:30:04 ass-pve-1308-01 pvesr[14126]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 19 10:30:05 ass-pve-1308-01 pvesr[14126]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 19 10:30:06 ass-pve-1308-01 pvesr[14126]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 19 10:30:07 ass-pve-1308-01 pvesr[14126]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 19 10:30:08 ass-pve-1308-01 pvesr[14126]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 19 10:30:09 ass-pve-1308-01 corosync[13248]:   [TOTEM ] Token has not been received in 247 ms
Jul 19 10:30:09 ass-pve-1308-01 pvesr[14126]: error with cfs lock 'file-replication_cfg': no quorum!
Jul 19 10:30:09 ass-pve-1308-01 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Jul 19 10:30:09 ass-pve-1308-01 systemd[1]: Failed to start Proxmox VE replication runner.
Jul 19 10:30:09 ass-pve-1308-01 systemd[1]: pvesr.service: Unit entered failed state.
Jul 19 10:30:09 ass-pve-1308-01 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jul 19 10:30:21 ass-pve-1308-01 corosync[13248]:   [TOTEM ] Token has not been received in 247 ms
Jul 19 10:30:31 ass-pve-1308-01 corosync[13248]:   [TOTEM ] Token has not been received in 247 ms
Jul 19 10:30:38 ass-pve-1308-01 corosync[13248]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 1500 bytes for host 1 link 0 but the other node is n
Jul 19 10:30:38 ass-pve-1308-01 corosync[13248]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfig
Jul 19 10:30:38 ass-pve-1308-01 corosync[13248]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 1366 to 1350
Jul 19 10:30:38 ass-pve-1308-01 corosync[13248]:   [KNET  ] pmtud: Global data MTU changed to: 1350

After restarting corosync on multiples nodes, I managed to achieve a quorum to at least restart the VMs, but it's not going to survive for long.

Is there a way to go back to corosync 2 ?

Many thanks,

Marin

fabian · Jul 19, 2019

your logs are cut off, please include output not copied from a pager, and with more lines after starting (e.g., until corosync.service gives up on starting). it also logs some potential issue about MTU misconfiguration - could you include your network config as well? "corosync-cmaptool -m stats" might also provide some insight (it prints the statistics collected by kronosnet/corosync about network traffic and links).

Marin Bernard · Jul 19, 2019

Hi again,

I'm sorry; since this is a production env, I had no choice but to destroy the cluster and revert to an array of independent nodes to restore the service.

Corosync does not give up on starting. It starts, tries to from a quorum, sometimes succeeds with a subset of the nodes, but often fails.

My network config is plain standard (no custom MTU anywhere, Intel boards, Cisco switches with a very simple config). Network config is similar on all nodes. Here is one from a failing node:

Code:

auto lo
iface lo inet loopback

iface enp3s0f0 inet manual
#MGMT-00

iface enp3s0f1 inet manual
#CORE-00

auto vmbr1424
iface vmbr1424 inet static
        address  10.6.100.202
        netmask  255.255.255.0
        gateway  10.6.100.254
        post-up  echo 1 > /proc/sys/net/ipv6/conf/vmbr1424/disable_ipv6
        bridge_ports enp3s0f0
        bridge_stp off
        bridge_fd 0
#PEP06_ROS_pMGMT

auto vmbr1460
iface vmbr1460 inet manual
        post-up  echo 1 > /proc/sys/net/ipv6/conf/vmbr1460/disable_ipv6
        bridge_ports enp3s0f1
        bridge_stp off
        bridge_fd 0
#PEP06_ROS_IEM_lCORE

Standard MTU is supported between all nodes:

Code:

root@ros-pve-1403-01:~# ping -M do -s 1472 ros-pve-1606-01
PING ros-pve-1606-01.core.ros.intranet.pep06.fr (10.6.100.203) 1472(1500) bytes of data.
1480 bytes from ros-pve-1606-01.core.ros.intranet.pep06.fr (10.6.100.203): icmp_seq=1 ttl=64 time=0.254 ms
1480 bytes from ros-pve-1606-01.core.ros.intranet.pep06.fr (10.6.100.203): icmp_seq=2 ttl=64 time=0.252 ms
1480 bytes from ros-pve-1606-01.core.ros.intranet.pep06.fr (10.6.100.203): icmp_seq=3 ttl=64 time=0.250 ms
1480 bytes from ros-pve-1606-01.core.ros.intranet.pep06.fr (10.6.100.203): icmp_seq=4 ttl=64 time=0.229 ms

Now that I have destroyed the cluster, I'm afraid it's going to be impossible to find what happened. From my point of view, the problem is linked with bad network connectivity: ssh sessions to the nodes were also hanging randomly for a few seconds while the cluster was degraded, while network latency was still < 1ms. Now that the cluster is destroyed, there's no more hanging. Maybe the switch to unicast is too demanding ?

fabian · Jul 19, 2019

I don't think that unicast per se is too demanding - but an 11 node cluster grasping to establish quorum possibly put load on your network, if it's only a vlan and not a real separate network that could also affect your SSH sessions. from the limited logs you provided it looks like if your network/vlan setup interfered with knet's MTU autodetection..

Apollon77 · Jul 31, 2019

I installed corosync 3 yesterday on my 7 node cluster and had a really bad day. I had kernel/host crashes (seemed kernel panics and the watchdog restarted the nodes) and also random disconnects up to a complete PVE cluster quorum lost because ... no idea.
In fact it is a home network for me all with Gigabit and I also saw hanging SSH connections while the rest of the network and connections worked just fine. Also the PVE-HA-Manager was hanging completely and also the PVE Webfrontend did not really responded ...

So I have no idea how this should continue ... will it be stable after the PVE upgrade or will this stay that way?

I was able to downgrade to corosync2 using:

apt-get install corosync=2.4.4-pve1

and removing the new corosync3 repo again

My corosync.conf:

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pm1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.178.76
  }
  node {
    name: pm2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.178.50
  }
  node {
    name: pm3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.178.81
  }
  node {
    name: pm4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.178.82
  }
  node {
    name: pm5
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 192.168.178.95
  }
  node {
    name: pm6
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 192.168.178.94
  }
  node {
    name: pm7
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 192.168.178.96
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: apollon-pm
  config_version: 7
  interface {
    bindnetaddr: 192.168.178.76
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

Log-wise I opften found things like

Code:

Jul 31 07:17:17 pm6 corosync[9920]:   [KNET  ] link: host: 3 link: 0 is down
Jul 31 07:17:17 pm6 corosync[9920]:   [KNET  ] link: host: 2 link: 0 is down
Jul 31 07:17:17 pm6 corosync[9920]:   [KNET  ] link: host: 1 link: 0 is down
Jul 31 07:17:17 pm6 corosync[9920]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 31 07:17:17 pm6 corosync[9920]:   [KNET  ] host: host: 3 has no active links
Jul 31 07:17:17 pm6 corosync[9920]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 31 07:17:17 pm6 corosync[9920]:   [KNET  ] host: host: 2 has no active links
Jul 31 07:17:17 pm6 corosync[9920]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 31 07:17:17 pm6 corosync[9920]:   [KNET  ] host: host: 1 has no active links
Jul 31 07:17:17 pm6 corosync[9920]:   [TOTEM ] Retransmit List: bdb5
Jul 31 07:17:18 pm6 corosync[9920]:   [TOTEM ] Retransmit List: bdc1
Jul 31 07:17:18 pm6 corosync[9920]:   [TOTEM ] Retransmit List: bdc2
Jul 31 07:17:19 pm6 corosync[9920]:   [KNET  ] rx: host: 2 link: 0 is up
Jul 31 07:17:19 pm6 corosync[9920]:   [KNET  ] rx: host: 3 link: 0 is up
Jul 31 07:17:19 pm6 corosync[9920]:   [KNET  ] rx: host: 1 link: 0 is up
Jul 31 07:17:19 pm6 corosync[9920]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 31 07:17:19 pm6 corosync[9920]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 31 07:17:19 pm6 corosync[9920]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)

And then as everything failed (without any real problems with the hosts!!):

Code:

Jul 31 20:18:27 pm6 corosync[1617]:   [TOTEM ] Retransmit List: b65c
Jul 31 20:18:27 pm6 corosync[1617]:   [TOTEM ] Retransmit List: b65d
Jul 31 20:18:32 pm6 corosync[1617]:   [TOTEM ] A new membership (4:104) was formed. Members left: 1 2 3
Jul 31 20:18:32 pm6 corosync[1617]:   [TOTEM ] Failed to receive the leave message. failed: 1 2 3
Jul 31 20:18:32 pm6 corosync[1617]:   [CPG   ] downlist left_list: 3 received
Jul 31 20:18:32 pm6 corosync[1617]:   [CPG   ] downlist left_list: 3 received
Jul 31 20:18:32 pm6 corosync[1617]:   [CPG   ] downlist left_list: 3 received
Jul 31 20:18:32 pm6 corosync[1617]:   [CPG   ] downlist left_list: 3 received
Jul 31 20:18:32 pm6 pmxcfs[1431]: [dcdb] notice: members: 4/1232, 5/1431, 6/19016, 7/3312
Jul 31 20:18:32 pm6 pmxcfs[1431]: [dcdb] notice: starting data syncronisation
Jul 31 20:18:32 pm6 pmxcfs[1431]: [status] notice: members: 4/1232, 5/1431, 6/19016, 7/3312
Jul 31 20:18:32 pm6 pmxcfs[1431]: [status] notice: starting data syncronisation
Jul 31 20:18:32 pm6 corosync[1617]:   [QUORUM] Members[4]: 4 5 6 7
Jul 31 20:18:32 pm6 corosync[1617]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 31 20:18:32 pm6 pmxcfs[1431]: [dcdb] notice: received sync request (epoch 4/1232/00000006)
Jul 31 20:18:32 pm6 pmxcfs[1431]: [status] notice: received sync request (epoch 4/1232/00000006)
Jul 31 20:18:32 pm6 pmxcfs[1431]: [dcdb] notice: received all states
Jul 31 20:18:32 pm6 pmxcfs[1431]: [dcdb] notice: leader is 4/1232
Jul 31 20:18:32 pm6 pmxcfs[1431]: [dcdb] notice: synced members: 4/1232, 5/1431, 6/19016, 7/3312
Jul 31 20:18:32 pm6 pmxcfs[1431]: [dcdb] notice: all data is up to date
Jul 31 20:18:32 pm6 pmxcfs[1431]: [dcdb] notice: dfsm_deliver_queue: queue length 12
Jul 31 20:18:32 pm6 pmxcfs[1431]: [status] notice: received all states
Jul 31 20:18:32 pm6 pmxcfs[1431]: [status] notice: all data is up to date
Jul 31 20:18:32 pm6 pmxcfs[1431]: [status] notice: dfsm_deliver_queue: queue length 18
Jul 31 20:18:41 pm6 corosync[1617]:   [TOTEM ] A new membership (4:108) was formed. Members
Jul 31 20:18:46 pm6 pmxcfs[1431]: [dcdb] notice: cpg_send_message retry 10
Jul 31 20:18:47 pm6 pmxcfs[1431]: [dcdb] notice: cpg_send_message retry 20
Jul 31 20:18:48 pm6 pmxcfs[1431]: [dcdb] notice: cpg_send_message retry 30
Jul 31 20:18:48 pm6 pmxcfs[1431]: [status] notice: cpg_send_message retry 10
Jul 31 20:18:49 pm6 pmxcfs[1431]: [dcdb] notice: cpg_send_message retry 40
Jul 31 20:18:49 pm6 pmxcfs[1431]: [status] notice: cpg_send_message retry 20
Jul 31 20:18:49 pm6 corosync[1617]:   [TOTEM ] A new membership (4:116) was formed. Members
Jul 31 20:18:49 pm6 corosync[1617]:   [TOTEM ] A new membership (1:124) was formed. Members joined: 1 2 3
Jul 31 20:18:49 pm6 corosync[1617]:   [TOTEM ] Retransmit List: 1 2
Jul 31 20:18:49 pm6 corosync[1617]:   [TOTEM ] Retransmit List: b c
Jul 31 20:18:49 pm6 corosync[1617]:   [TOTEM ] Retransmit List: d
Jul 31 20:18:50 pm6 corosync[1617]:   [TOTEM ] Retransmit List: 11
Jul 31 20:18:50 pm6 pmxcfs[1431]: [dcdb] notice: cpg_send_message retry 50
Jul 31 20:18:50 pm6 pmxcfs[1431]: [status] notice: cpg_send_message retry 30
Jul 31 20:18:51 pm6 pmxcfs[1431]: [dcdb] notice: cpg_send_message retry 60
Jul 31 20:18:51 pm6 pmxcfs[1431]: [status] notice: cpg_send_message retry 40
Jul 31 20:18:52 pm6 pmxcfs[1431]: [dcdb] notice: cpg_send_message retry 70
Jul 31 20:18:52 pm6 pmxcfs[1431]: [status] notice: cpg_send_message retry 50
Jul 31 20:18:53 pm6 pmxcfs[1431]: [dcdb] notice: cpg_send_message retry 80
Jul 31 20:18:53 pm6 pmxcfs[1431]: [status] notice: cpg_send_message retry 60
Jul 31 20:18:54 pm6 pmxcfs[1431]: [dcdb] notice: cpg_send_message retry 90
Jul 31 20:18:54 pm6 pmxcfs[1431]: [status] notice: cpg_send_message retry 70
Jul 31 20:18:55 pm6 pmxcfs[1431]: [dcdb] notice: cpg_send_message retry 100
Jul 31 20:18:55 pm6 pmxcfs[1431]: [dcdb] notice: cpg_send_message retried 100 times
Jul 31 20:18:55 pm6 pmxcfs[1431]: [dcdb] crit: cpg_send_message failed: 6
Jul 31 20:18:55 pm6 pve-ha-lrm[1832]: lost lock 'ha_agent_pm6_lock - cfs lock update failed - Device or resource busy
Jul 31 20:18:55 pm6 pve-ha-lrm[1832]: status change active => lost_agent_lock
Jul 31 20:18:55 pm6 pmxcfs[1431]: [status] notice: cpg_send_message retry 80
Jul 31 20:18:56 pm6 pmxcfs[1431]: [status] notice: cpg_send_message retry 90

And the messages at the end after that continued like this and everything was broken.
On an other host It was stranger, especially thos long retransmit lists:

Code:

Jul 31 20:18:27 pm4 corosync[1432]:   [TOTEM ] Retransmit List: b4b5 b4b9 b4c3 b4ca b4cf b4d0 b4d2 b4d4 b546 b54d b54f b550 b55b b562 b563 b564 b5a3
Jul 31 20:18:27 pm4 corosync[1432]:   [TOTEM ] Retransmit List: b4b5 b4b9 b4c3 b4ca b4cf b4d0 b4d2 b4d4 b546 b54d b54f b550 b55b b562 b563 b564 b5a3
Jul 31 20:18:27 pm4 corosync[1432]:   [TOTEM ] Retransmit List: b4b5 b4b9 b4c3 b4ca b4cf b4d0 b4d2 b4d4 b546 b54d b54f b550 b55b b562 b563 b564 b5a3
Jul 31 20:18:27 pm4 corosync[1432]:   [TOTEM ] Retransmit List: b4b5 b4b9 b4c3 b4ca b4cf b4d0 b4d2 b4d4 b546 b54d b54f b550 b55b b562 b563 b564 b5a3
Jul 31 20:18:27 pm4 corosync[1432]:   [TOTEM ] Retransmit List: b4b5 b4b9 b4c3 b4ca b4cf b4d0 b4d2 b4d4 b546 b54d b54f b550 b55b b562 b563 b564 b5a3
Jul 31 20:18:32 pm4 corosync[1432]:   [TOTEM ] A new membership (4:104) was formed. Members left: 1 2 3
Jul 31 20:18:32 pm4 corosync[1432]:   [TOTEM ] Failed to receive the leave message. failed: 1 2 3
Jul 31 20:18:32 pm4 corosync[1432]:   [CPG   ] downlist left_list: 3 received
Jul 31 20:18:32 pm4 corosync[1432]:   [CPG   ] downlist left_list: 3 received
Jul 31 20:18:32 pm4 corosync[1432]:   [CPG   ] downlist left_list: 3 received
Jul 31 20:18:32 pm4 pmxcfs[1232]: [dcdb] notice: members: 4/1232, 5/1431, 6/19016, 7/3312
Jul 31 20:18:32 pm4 pmxcfs[1232]: [dcdb] notice: starting data syncronisation
Jul 31 20:18:32 pm4 corosync[1432]:   [CPG   ] downlist left_list: 3 received
Jul 31 20:18:32 pm4 corosync[1432]:   [QUORUM] Members[4]: 4 5 6 7
Jul 31 20:18:32 pm4 corosync[1432]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 31 20:18:32 pm4 pmxcfs[1232]: [dcdb] notice: cpg_send_message retried 1 times
Jul 31 20:18:32 pm4 pmxcfs[1232]: [status] notice: members: 4/1232, 5/1431, 6/19016, 7/3312
Jul 31 20:18:32 pm4 pmxcfs[1232]: [status] notice: starting data syncronisation
Jul 31 20:18:32 pm4 pmxcfs[1232]: [dcdb] notice: received sync request (epoch 4/1232/00000006)
Jul 31 20:18:32 pm4 pmxcfs[1232]: [status] notice: received sync request (epoch 4/1232/00000006)
Jul 31 20:18:32 pm4 pmxcfs[1232]: [dcdb] notice: received all states
Jul 31 20:18:32 pm4 pmxcfs[1232]: [dcdb] notice: leader is 4/1232
Jul 31 20:18:32 pm4 pmxcfs[1232]: [dcdb] notice: synced members: 4/1232, 5/1431, 6/19016, 7/3312
Jul 31 20:18:32 pm4 pmxcfs[1232]: [dcdb] notice: start sending inode updates
Jul 31 20:18:32 pm4 pmxcfs[1232]: [dcdb] notice: sent all (0) updates
Jul 31 20:18:32 pm4 pmxcfs[1232]: [dcdb] notice: all data is up to date
Jul 31 20:18:32 pm4 pmxcfs[1232]: [dcdb] notice: dfsm_deliver_queue: queue length 12
Jul 31 20:18:32 pm4 pmxcfs[1232]: [status] notice: received all states
Jul 31 20:18:32 pm4 pmxcfs[1232]: [status] notice: all data is up to date
Jul 31 20:18:32 pm4 pmxcfs[1232]: [status] notice: dfsm_deliver_queue: queue length 18
Jul 31 20:18:33 pm4 pvesr[22319]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 31 20:18:34 pm4 pvesr[22319]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 31 20:18:35 pm4 pvesr[22319]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 31 20:18:36 pm4 pvesr[22319]: error with cfs lock 'file-replication_cfg': got lock request timeout
Jul 31 20:18:37 pm4 systemd[1]: pvesr.service: Main process exited, code=exited, status=17/n/a
Jul 31 20:18:37 pm4 systemd[1]: Failed to start Proxmox VE replication runner.
Jul 31 20:18:37 pm4 systemd[1]: pvesr.service: Unit entered failed state.
Jul 31 20:18:37 pm4 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jul 31 20:18:41 pm4 corosync[1432]:   [TOTEM ] A new membership (4:108) was formed. Members
Jul 31 20:18:46 pm4 pmxcfs[1232]: [status] notice: cpg_send_message retry 10
Jul 31 20:18:47 pm4 pmxcfs[1232]: [status] notice: cpg_send_message retry 20
Jul 31 20:18:48 pm4 pmxcfs[1232]: [status] notice: cpg_send_message retry 30
Jul 31 20:18:49 pm4 pmxcfs[1232]: [status] notice: cpg_send_message retry 40
Jul 31 20:18:49 pm4 corosync[1432]:   [TOTEM ] A new membership (4:116) was formed. Members
Jul 31 20:18:49 pm4 corosync[1432]:   [TOTEM ] A new membership (1:124) was formed. Members joined: 1 2 3
Jul 31 20:18:49 pm4 corosync[1432]:   [TOTEM ] Retransmit List: 4 5 6
Jul 31 20:18:49 pm4 corosync[1432]:   [TOTEM ] Retransmit List: 4 5 6
Jul 31 20:18:49 pm4 corosync[1432]:   [TOTEM ] Retransmit List: 4 5 6
Jul 31 20:18:50 pm4 corosync[1432]:   [TOTEM ] Retransmit List: 4 5 6
Jul 31 20:18:50 pm4 corosync[1432]:   [TOTEM ] Retransmit List: 4 5 6
Jul 31 20:18:50 pm4 corosync[1432]:   [TOTEM ] Retransmit List: 4 5 6
Jul 31 20:18:50 pm4 corosync[1432]:   [TOTEM ] Retransmit List: 4 5 6
Jul 31 20:18:50 pm4 corosync[1432]:   [TOTEM ] Retransmit List: 4 5 6 12
Jul 31 20:18:50 pm4 corosync[1432]:   [TOTEM ] Retransmit List: 4 5 6 12

And an other host:

Code:

Jul 31 20:18:00 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b534
Jul 31 20:18:00 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b535
Jul 31 20:18:00 pm5 systemd[1]: Starting Proxmox VE replication runner...
Jul 31 20:18:01 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b53c
Jul 31 20:18:02 pm5 pvesr[22667]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 31 20:18:03 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b54b
Jul 31 20:18:04 pm5 pvesr[22667]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 31 20:18:04 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b556
Jul 31 20:18:05 pm5 pvesr[22667]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 31 20:18:05 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b557
Jul 31 20:18:05 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b558
Jul 31 20:18:05 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b559
Jul 31 20:18:05 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b55a
Jul 31 20:18:05 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b561
Jul 31 20:18:06 pm5 pvesr[22667]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 31 20:18:06 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b57c
Jul 31 20:18:06 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b57d
Jul 31 20:18:07 pm5 pvesr[22667]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 31 20:18:07 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b587
Jul 31 20:18:07 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b589
Jul 31 20:18:07 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b58a
Jul 31 20:18:07 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b58b
Jul 31 20:18:07 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b58c
Jul 31 20:18:07 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b58d
Jul 31 20:18:07 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b58e
Jul 31 20:18:08 pm5 pvesr[22667]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 31 20:18:08 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b5a5
Jul 31 20:18:09 pm5 pvesr[22667]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 31 20:18:09 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b5b2
Jul 31 20:18:10 pm5 pvesr[22667]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 31 20:18:10 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b5bc
Jul 31 20:18:11 pm5 pvesr[22667]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 31 20:18:11 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b5c1
Jul 31 20:18:11 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b5c2
Jul 31 20:18:12 pm5 pvesr[22667]: error with cfs lock 'file-replication_cfg': got lock request timeout
Jul 31 20:18:12 pm5 systemd[1]: pvesr.service: Main process exited, code=exited, status=17/n/a
Jul 31 20:18:12 pm5 systemd[1]: Failed to start Proxmox VE replication runner.
Jul 31 20:18:12 pm5 systemd[1]: pvesr.service: Unit entered failed state.
Jul 31 20:18:12 pm5 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jul 31 20:18:12 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b5c7
Jul 31 20:18:13 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b5ce
Jul 31 20:18:14 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b5d3
Jul 31 20:18:15 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b5d4
Jul 31 20:18:15 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b5d5
Jul 31 20:18:15 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b5d6
Jul 31 20:18:15 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b5d7
Jul 31 20:18:15 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b5db
Jul 31 20:18:16 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b5f1
Jul 31 20:18:16 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b5f3
Jul 31 20:18:17 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b5fc
Jul 31 20:18:17 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b600
Jul 31 20:18:17 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b601
Jul 31 20:18:17 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b602
Jul 31 20:18:17 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b603
Jul 31 20:18:18 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b612
Jul 31 20:18:18 pm5 systemd[1]: Started Check_MK (192.168.178.83:42880).
Jul 31 20:18:19 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b61b
Jul 31 20:18:20 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b622
Jul 31 20:18:21 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b627
Jul 31 20:18:21 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b628
Jul 31 20:18:22 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b62d
Jul 31 20:18:23 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b634
Jul 31 20:18:25 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b63c
Jul 31 20:18:25 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b63d
Jul 31 20:18:25 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b63e
Jul 31 20:18:25 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b63f
Jul 31 20:18:26 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b654
Jul 31 20:18:27 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b65c
Jul 31 20:18:27 pm5 corosync[19181]:   [TOTEM ] Retransmit List: b65d
Jul 31 20:18:32 pm5 corosync[19181]:   [TOTEM ] A new membership (4:104) was formed. Members left: 1 2 3
Jul 31 20:18:32 pm5 corosync[19181]:   [TOTEM ] Failed to receive the leave message. failed: 1 2 3
Jul 31 20:18:32 pm5 corosync[19181]:   [CPG   ] downlist left_list: 3 received
Jul 31 20:18:32 pm5 corosync[19181]:   [CPG   ] downlist left_list: 3 received
Jul 31 20:18:32 pm5 corosync[19181]:   [CPG   ] downlist left_list: 3 received
Jul 31 20:18:32 pm5 corosync[19181]:   [CPG   ] downlist left_list: 3 received
Jul 31 20:18:32 pm5 pmxcfs[19016]: [dcdb] notice: members: 4/1232, 5/1431, 6/19016, 7/3312
Jul 31 20:18:32 pm5 pmxcfs[19016]: [dcdb] notice: starting data syncronisation
Jul 31 20:18:32 pm5 pmxcfs[19016]: [status] notice: members: 4/1232, 5/1431, 6/19016, 7/3312
Jul 31 20:18:32 pm5 pmxcfs[19016]: [status] notice: starting data syncronisation
Jul 31 20:18:32 pm5 corosync[19181]:   [QUORUM] Members[4]: 4 5 6 7
Jul 31 20:18:32 pm5 corosync[19181]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 31 20:18:32 pm5 pmxcfs[19016]: [dcdb] notice: received sync request (epoch 4/1232/00000006)
Jul 31 20:18:32 pm5 pmxcfs[19016]: [status] notice: received sync request (epoch 4/1232/00000006)
Jul 31 20:18:32 pm5 pmxcfs[19016]: [dcdb] notice: received all states
Jul 31 20:18:32 pm5 pmxcfs[19016]: [dcdb] notice: leader is 4/1232
Jul 31 20:18:32 pm5 pmxcfs[19016]: [dcdb] notice: synced members: 4/1232, 5/1431, 6/19016, 7/3312
Jul 31 20:18:32 pm5 pmxcfs[19016]: [dcdb] notice: all data is up to date
Jul 31 20:18:32 pm5 pmxcfs[19016]: [dcdb] notice: dfsm_deliver_queue: queue length 12
Jul 31 20:18:32 pm5 pmxcfs[19016]: [status] notice: received all states
Jul 31 20:18:32 pm5 pmxcfs[19016]: [status] notice: all data is up to date
Jul 31 20:18:32 pm5 pmxcfs[19016]: [status] notice: dfsm_deliver_queue: queue length 18
Jul 31 20:18:41 pm5 corosync[19181]:   [TOTEM ] A new membership (4:108) was formed. Members
Jul 31 20:18:46 pm5 pmxcfs[19016]: [dcdb] notice: cpg_send_message retry 10
Jul 31 20:18:47 pm5 pmxcfs[19016]: [dcdb] notice: cpg_send_message retry 20
Jul 31 20:18:48 pm5 pmxcfs[19016]: [dcdb] notice: cpg_send_message retry 30
Jul 31 20:18:49 pm5 pmxcfs[19016]: [dcdb] notice: cpg_send_message retry 40
Jul 31 20:18:49 pm5 pmxcfs[19016]: [status] notice: cpg_send_message retry 10
Jul 31 20:18:49 pm5 corosync[19181]:   [TOTEM ] A new membership (4:116) was formed. Members
Jul 31 20:18:49 pm5 corosync[19181]:   [TOTEM ] A new membership (1:124) was formed. Members joined: 1 2 3
Jul 31 20:18:49 pm5 corosync[19181]:   [TOTEM ] Retransmit List: 1 2
Jul 31 20:18:50 pm5 corosync[19181]:   [TOTEM ] Retransmit List: 11
Jul 31 20:18:50 pm5 pmxcfs[19016]: [dcdb] notice: cpg_send_message retry 50
Jul 31 20:18:50 pm5 pmxcfs[19016]: [status] notice: cpg_send_message retry 20
Jul 31 20:18:51 pm5 pmxcfs[19016]: [dcdb] notice: cpg_send_message retry 60
Jul 31 20:18:51 pm5 pmxcfs[19016]: [status] notice: cpg_send_message retry 30
Jul 31 20:18:52 pm5 pmxcfs[19016]: [dcdb] notice: cpg_send_message retry 70
Jul 31 20:18:52 pm5 pmxcfs[19016]: [status] notice: cpg_send_message retry 40
Jul 31 20:18:53 pm5 pmxcfs[19016]: [dcdb] notice: cpg_send_message retry 80
Jul 31 20:18:53 pm5 pmxcfs[19016]: [status] notice: cpg_send_message retry 50
Jul 31 20:18:54 pm5 pmxcfs[19016]: [dcdb] notice: cpg_send_message retry 90
Jul 31 20:18:54 pm5 pmxcfs[19016]: [status] notice: cpg_send_message retry 60
Jul 31 20:18:55 pm5 pmxcfs[19016]: [dcdb] notice: cpg_send_message retry 100
Jul 31 20:18:55 pm5 pmxcfs[19016]: [dcdb] notice: cpg_send_message retried 100 times
Jul 31 20:18:55 pm5 pmxcfs[19016]: [dcdb] crit: cpg_send_message failed: 6
Jul 31 20:18:55 pm5 pve-ha-lrm[20562]: lost lock 'ha_agent_pm5_lock - cfs lock update failed - Device or resource busy
Jul 31 20:18:55 pm5 pve-ha-lrm[20562]: status change active => lost_agent_lock
Jul 31 20:18:55 pm5 pmxcfs[19016]: [status] notice: cpg_send_message retry 70
Jul 31 20:18:56 pm5 pmxcfs[19016]: [status] notice: cpg_send_message retry 80
Jul 31 20:18:57 pm5 pmxcfs[19016]: [status] notice: cpg_send_message retry 90
Jul 31 20:18:58 pm5 pmxcfs[19016]: [status] notice: cpg_send_message retry 100
Jul 31 20:18:58 pm5 pmxcfs[19016]: [status] notice: cpg_send_message retried 100 times
Jul 31 20:18:58 pm5 pmxcfs[19016]: [status] crit: cpg_send_message failed: 6
Jul 31 20:18:59 pm5 pmxcfs[19016]: [status] notice: cpg_send_message retry 10
Jul 31 20:19:00 pm5 pmxcfs[19016]: [status] notice: cpg_send_message retry 20
Jul 31 20:19:00 pm5 systemd[1]: Starting Proxmox VE replication runner...
Jul 31 20:19:01 pm5 pmxcfs[19016]: [status] notice: cpg_send_message retry 30
Jul 31 20:19:02 pm5 pmxcfs[19016]: [status] notice: cpg_send_message retry 40
Jul 31 20:19:03 pm5 pmxcfs[19016]: [status] notice: cpg_send_message retry 50
Jul 31 20:19:04 pm5 pmxcfs[19016]: [status] notice: cpg_send_message retry 60
Jul 31 20:19:05 pm5 pmxcfs[19016]: [status] notice: cpg_send_message retry 70
Jul 31 20:19:06 pm5 pmxcfs[19016]: [status] notice: cpg_send_message retry 80
Jul 31 20:19:07 pm5 pmxcfs[19016]: [status] notice: cpg_send_message retry 90
Jul 31 20:19:08 pm5 pmxcfs[19016]: [status] notice: cpg_send_message retry 100
Jul 31 20:19:08 pm5 pmxcfs[19016]: [status] notice: cpg_send_message retried 100 times
Jul 31 20:19:08 pm5 pmxcfs[19016]: [status] crit: cpg_send_message failed: 6
Jul 31 20:19:08 pm5 pvesr[23134]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 31 20:19:09 pm5 pmxcfs[19016]: [dcdb] notice: cpg_send_message retry 10
Jul 31 20:19:09 pm5 pmxcfs[19016]: [status] notice: cpg_send_message retry 10

Maybe all the logs help

WesC · Aug 1, 2019

We are seeing the same behaviour on a pve 6 cluster which was upgraded from pve 5. Not sure if the root cause is the same but the symptom surely is. Logs seem to indicate pvesr hanging on one or another host right before the quorum loss, but we don't have any replication jobs configured, so maybe this too is just a symptom of a deeper root cause. Corosync needs to be restarted on most/all nodes before quorum is restored. We didn't see this issue with pve 5. We are using Ceph storage on these hosts. There is no indication of any issues on the Ceph side of things, we have rebuilt all OSDs involved since upgrading (due to the bluefs wal/journal overflow issue when upgrading to Nautilus) and Ceph has reported being healthy ever since.

Right around the time of this issue manifesting, corosync reports Token not received - repeatedly, ie:

Code:

Aug  1 04:51:52 sanctuary corosync[369105]:   [TOTEM ] Token has not been received in 61 ms
Aug  1 04:51:55 sanctuary corosync[369105]:   [TOTEM ] A new membership (2:273076) was formed. Members left: 1
Aug  1 04:51:55 sanctuary corosync[369105]:   [TOTEM ] Failed to receive the leave message. failed: 1
Aug  1 04:51:58 sanctuary corosync[369105]:   [TOTEM ] A new membership (2:273080) was formed. Members
Aug  1 04:51:58 sanctuary corosync[369105]:   [CPG   ] downlist left_list: 1 received
Aug  1 04:51:58 sanctuary corosync[369105]:   [CPG   ] downlist left_list: 1 received
Aug  1 04:51:58 sanctuary pmxcfs[217081]: [dcdb] notice: members: 2/217081, 3/1330
Aug  1 04:51:58 sanctuary pmxcfs[217081]: [dcdb] notice: starting data syncronisation
Aug  1 04:51:59 sanctuary pmxcfs[217081]: [dcdb] notice: cpg_send_message retry 10
Aug  1 04:52:00 sanctuary systemd[1]: Starting Proxmox VE replication runner...
Aug  1 04:52:00 sanctuary pmxcfs[217081]: [dcdb] notice: cpg_send_message retry 20
Aug  1 04:52:00 sanctuary corosync[369105]:   [TOTEM ] A new membership (2:273084) was formed. Members
Aug  1 04:52:01 sanctuary pmxcfs[217081]: [dcdb] notice: cpg_send_message retry 30
Aug  1 04:52:02 sanctuary pmxcfs[217081]: [dcdb] notice: cpg_send_message retry 40
Aug  1 04:52:02 sanctuary corosync[369105]:   [TOTEM ] A new membership (2:273088) was formed. Members
Aug  1 04:52:02 sanctuary corosync[369105]:   [CPG   ] downlist left_list: 0 received
Aug  1 04:52:03 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 10
Aug  1 04:52:03 sanctuary pmxcfs[217081]: [dcdb] notice: cpg_send_message retry 50
Aug  1 04:52:04 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 20
Aug  1 04:52:04 sanctuary pmxcfs[217081]: [dcdb] notice: cpg_send_message retry 60
Aug  1 04:52:05 sanctuary corosync[369105]:   [TOTEM ] A new membership (2:273092) was formed. Members
Aug  1 04:52:05 sanctuary corosync[369105]:   [CPG   ] downlist left_list: 0 received
Aug  1 04:52:05 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 30
Aug  1 04:52:05 sanctuary pmxcfs[217081]: [dcdb] notice: cpg_send_message retry 70
Aug  1 04:52:06 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 40
Aug  1 04:52:06 sanctuary pmxcfs[217081]: [dcdb] notice: cpg_send_message retry 80
Aug  1 04:52:07 sanctuary corosync[369105]:   [TOTEM ] A new membership (2:273096) was formed. Members
Aug  1 04:52:07 sanctuary corosync[369105]:   [CPG   ] downlist left_list: 0 received
Aug  1 04:52:07 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 50
Aug  1 04:52:07 sanctuary pmxcfs[217081]: [dcdb] notice: cpg_send_message retry 90
Aug  1 04:52:08 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 60
Aug  1 04:52:08 sanctuary pmxcfs[217081]: [dcdb] notice: cpg_send_message retry 100
Aug  1 04:52:08 sanctuary pmxcfs[217081]: [dcdb] notice: cpg_send_message retried 100 times
Aug  1 04:52:08 sanctuary pmxcfs[217081]: [status] notice: members: 2/217081, 3/1330
Aug  1 04:52:08 sanctuary pmxcfs[217081]: [status] notice: starting data syncronisation
Aug  1 04:52:09 sanctuary corosync[369105]:   [TOTEM ] A new membership (2:273100) was formed. Members
Aug  1 04:52:09 sanctuary corosync[369105]:   [CPG   ] downlist left_list: 0 received
Aug  1 04:52:09 sanctuary corosync[369105]:   [CPG   ] downlist left_list: 0 received
Aug  1 04:52:09 sanctuary corosync[369105]:   [CPG   ] downlist left_list: 0 received
Aug  1 04:52:09 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 70
Aug  1 04:52:09 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 10
Aug  1 04:52:10 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 80
Aug  1 04:52:10 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 20
Aug  1 04:52:11 sanctuary corosync[369105]:   [TOTEM ] A new membership (2:273104) was formed. Members
Aug  1 04:52:11 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 90
Aug  1 04:52:11 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 30
Aug  1 04:52:12 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 100
Aug  1 04:52:12 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retried 100 times
Aug  1 04:52:12 sanctuary pmxcfs[217081]: [status] crit: cpg_send_message failed: 6
Aug  1 04:52:12 sanctuary pve-firewall[1711]: firewall update time (9.326 seconds)
Aug  1 04:52:12 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 40
Aug  1 04:52:13 sanctuary corosync[369105]:   [CPG   ] downlist left_list: 0 received
Aug  1 04:52:13 sanctuary corosync[369105]:   [TOTEM ] A new membership (2:273108) was formed. Members
Aug  1 04:52:13 sanctuary corosync[369105]:   [CPG   ] downlist left_list: 0 received
Aug  1 04:52:13 sanctuary corosync[369105]:   [CPG   ] downlist left_list: 0 received
Aug  1 04:52:13 sanctuary corosync[369105]:   [CPG   ] downlist left_list: 0 received
Aug  1 04:52:13 sanctuary corosync[369105]:   [QUORUM] Members[2]: 2 3
Aug  1 04:52:13 sanctuary corosync[369105]:   [MAIN  ] Completed service synchronization, ready to provide service.

From this point quorum is never again reached until corosync is restarted on multiple hosts. I also notice corosync reporting very strange MTU training behaviour right before, during, and after this issue:

Code:

Aug  1 04:55:41 sanctuary corosync[369105]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 1500 bytes for host 1 link 0 but the other node is not acknowledging packets of this size. 
Aug  1 04:55:41 sanctuary corosync[369105]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run but performances might be affected.
Aug  1 04:55:41 sanctuary corosync[369105]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 1366 to 1350
Aug  1 04:55:41 sanctuary corosync[369105]:   [KNET  ] pmtud: Global data MTU changed to: 1350
Aug  1 04:56:11 sanctuary corosync[369105]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 1350 to 1366
Aug  1 04:56:11 sanctuary corosync[369105]:   [KNET  ] pmtud: Global data MTU changed to: 1366
Aug  1 05:40:45 sanctuary corosync[369105]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 1500 bytes for host 1 link 0 but the other node is not acknowledging packets of this size. 
Aug  1 05:40:45 sanctuary corosync[369105]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run but performances might be affected.
Aug  1 05:40:45 sanctuary corosync[369105]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 1366 to 1350
Aug  1 05:40:45 sanctuary corosync[369105]:   [KNET  ] pmtud: Global data MTU changed to: 1350
Aug  1 05:41:15 sanctuary corosync[369105]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 1350 to 1366
Aug  1 05:41:15 sanctuary corosync[369105]:   [KNET  ] pmtud: Global data MTU changed to: 1366

Of course, there is no real MTU change on this network. The hosts are directly connected through a single switch, jumbo frames haven't been configured yet, and everything is 1500 with no encapsulating protocols in effect.

Is there a way to just disable this userspace MTU path detection corosync tries to do? I suspect it's just plain broken..

edit: should also note that this seems to happen at roughly the same time interval as the original post, every 12-24 hours, but not exactly the same time ever morning. Possibly it's a fixed time from the last corosync restart?

Apollon77 · Aug 1, 2019

BTW: I also have no replication configured - and the /etc/pve/file_replication.cfg file is not even existing, but also with PVE5 I get "
pvesr[22667]: trying to acquire cfs lock 'file-replication_cfg' ..." sometimes ... but it didn't hurt so far on PVE 5 (and the error is also not coming from corosync log). So this may not be connected

WesC · Aug 1, 2019

*maybe* possibly related to this issue on knet with pmtud when using crypto? ://github.com/kronosnet/kronosnet/pull/242/commits if so it looks like a possible fix is in the pipeline already then (had to remove the https because the forum won't let me post links)

fabian · Aug 2, 2019

@WesC could you try installing the following test build on all nodes, and restarting corosync afterwards?

http://download.proxmox.com/temp/libknet1_1.10-pve2~test1_amd64.deb

Code:

$ sha256sum libknet1_1.10-pve2~test1_amd64.deb
64521083486b6b2683826cc95f6d869ab3fde01e9b6d1ae90fae49fcac305478  libknet1_1.10-pve2~test1_amd64.deb

WesC · Aug 2, 2019

fabian said:
@WesC could you try installing the following test build on all nodes, and restarting corosync afterwards?

Hi Fabian, I'd be happy to test. I'll report back after the weekend if it's stable, or sooner if not.

fabian · Aug 2, 2019

thanks!

Apollon77 · Aug 2, 2019

fabian said:
New @WesC could you try installing the following test build on all nodes, and restarting corosync afterwards?

Would this also be testable for PVE 5.4? Because libknet was not in the list of installed packages when upgrading PVE 5.4 to corosync3 ... ?

fabian · Aug 2, 2019

Apollon77 said:
Would this also be testable for PVE 5.4? Because libknet was not in the list of installed packages when upgrading PVE 5.4 to corosync3 ... ?

it's not in the list of upgradable packages, but it gets pulled in when upgrading to Corosync 3.x because the corosync package has a dependency on libknet1. I can make a test-build for PVE 5.4 with corosync-3 repo as well if you want, but I think waiting for results from PVE 6.x users is probably enough to triage for now.

Apollon77 · Aug 2, 2019

fabian said:
but I think waiting for results from PVE 6.x users is probably enough to triage for now.

yes

Unless my second isue with the glusterfs dependency on upgrade is not solved I'm blocked anyway :-(

WesC · Aug 3, 2019

Hi @fabian, sorry to report this version still has issues. I don't see the PMTUD issue in the logs today but one host still had issues keeping quorum, reported lost tokens for about an hour, (all of which might be a separate issue elsewhere) but then its kernel reported a crash in libknet.so:

Code:

Aug  3 02:22:39 scramjet corosync[1136337]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug  3 02:24:01 scramjet corosync[1136337]:   [TOTEM ] Token has not been received in 1237 ms
Aug  3 02:24:01 scramjet corosync[1136337]:   [TOTEM ] A processor failed, forming new configuration.
Aug  3 02:24:01 scramjet corosync[1136337]:   [TOTEM ] A new membership (1:290496) was formed. Members
Aug  3 02:24:01 scramjet corosync[1136337]:   [CPG   ] downlist left_list: 0 received
Aug  3 02:24:01 scramjet corosync[1136337]:   [CPG   ] downlist left_list: 0 received
Aug  3 02:24:01 scramjet corosync[1136337]:   [CPG   ] downlist left_list: 0 received
Aug  3 02:24:01 scramjet corosync[1136337]:   [QUORUM] Members[3]: 1 2 3
Aug  3 02:24:01 scramjet corosync[1136337]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug  3 02:28:52 scramjet corosync[1136337]:   [TOTEM ] Token has not been received in 1237 ms
Aug  3 02:28:52 scramjet corosync[1136337]:   [TOTEM ] A processor failed, forming new configuration.
Aug  3 02:28:52 scramjet corosync[1136337]:   [TOTEM ] A new membership (1:290500) was formed. Members
Aug  3 02:28:52 scramjet corosync[1136337]:   [CPG   ] downlist left_list: 0 received
Aug  3 02:28:52 scramjet corosync[1136337]:   [CPG   ] downlist left_list: 0 received
Aug  3 02:28:52 scramjet corosync[1136337]:   [CPG   ] downlist left_list: 0 received
Aug  3 02:28:52 scramjet corosync[1136337]:   [QUORUM] Members[3]: 1 2 3
Aug  3 02:28:52 scramjet corosync[1136337]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug  3 02:37:28 scramjet corosync[1136337]:   [KNET  ] link: host: 3 link: 0 is down
Aug  3 02:37:28 scramjet corosync[1136337]:   [KNET  ] link: host: 2 link: 0 is down
Aug  3 02:37:28 scramjet corosync[1136337]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug  3 02:37:28 scramjet corosync[1136337]:   [KNET  ] host: host: 3 has no active links
Aug  3 02:37:28 scramjet corosync[1136337]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug  3 02:37:28 scramjet corosync[1136337]:   [KNET  ] host: host: 2 has no active links
Aug  3 02:37:29 scramjet corosync[1136337]:   [TOTEM ] Token has not been received in 63 ms
Aug  3 02:37:29 scramjet corosync[1136337]:   [KNET  ] rx: host: 3 link: 0 is up
Aug  3 02:37:29 scramjet corosync[1136337]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug  3 02:37:29 scramjet corosync[1136337]:   [KNET  ] rx: host: 2 link: 0 is up
Aug  3 02:37:29 scramjet corosync[1136337]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug  3 02:37:29 scramjet corosync[1136337]:   [TOTEM ] A new membership (1:290504) was formed. Members
Aug  3 02:37:29 scramjet corosync[1136337]:   [CPG   ] downlist left_list: 0 received
Aug  3 02:37:29 scramjet corosync[1136337]:   [CPG   ] downlist left_list: 0 received
Aug  3 02:37:29 scramjet corosync[1136337]:   [CPG   ] downlist left_list: 0 received
Aug  3 02:37:29 scramjet corosync[1136337]:   [QUORUM] Members[3]: 1 2 3
Aug  3 02:37:29 scramjet corosync[1136337]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug  3 02:43:32 scramjet corosync[1136337]:   [KNET  ] link: host: 3 link: 0 is down
Aug  3 02:43:32 scramjet corosync[1136337]:   [KNET  ] link: host: 2 link: 0 is down
Aug  3 02:43:32 scramjet corosync[1136337]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug  3 02:43:32 scramjet corosync[1136337]:   [KNET  ] host: host: 3 has no active links
Aug  3 02:43:32 scramjet corosync[1136337]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug  3 02:43:32 scramjet corosync[1136337]:   [KNET  ] host: host: 2 has no active links
Aug  3 02:43:33 scramjet corosync[1136337]:   [TOTEM ] Token has not been received in 61 ms
Aug  3 02:43:33 scramjet corosync[1136337]:   [KNET  ] rx: host: 3 link: 0 is up
Aug  3 02:43:33 scramjet corosync[1136337]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug  3 02:49:29 scramjet corosync[1136337]:   [KNET  ] link: host: 3 link: 0 is down
Aug  3 02:49:29 scramjet corosync[1136337]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug  3 02:49:29 scramjet corosync[1136337]:   [KNET  ] host: host: 3 has no active links
Aug  3 02:49:30 scramjet corosync[1136337]:   [KNET  ] rx: host: 3 link: 0 is up
Aug  3 02:49:30 scramjet corosync[1136337]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)\

Code:

Aug  3 02:49:31 scramjet kernel: [246745.376178] traps: corosync[1136352] trap divide error ip:xxxxxxxx08e6 sp:xxxxxxxx0a50 error:0 in libknet.so.1.2.0[xxxxxxxx5000+13000]
Aug  3 02:49:31 scramjet systemd[1]: corosync.service: Main process exited, code=killed, status=8/FPE
Aug  3 02:49:31 scramjet systemd[1]: corosync.service: Failed with result 'signal'.

edit: The other hosts also report lost tokens, retransmits, and repeatedly forming new membership lists during this time, however they did regain quorum. The specific host above where the FPE was seen is an AMD EPYC system, the other 2 are Intels so this crash might be due to some lower level platform issue in gcc/glibc/etc and maybe not libknet's fault per se. I'll leave this cluster running this version of libknet for the rest of the weekend to see if it happens again. I'd also like to note that we didn't see any issues like this or the PMTUD one when running the cluster on the previous libknet version but with secauth: off. (I assume that's not an ideal mode to run corosync with in a multi-tennant environment, however this cluster is internal at our office and single tennant)

WesC · Aug 4, 2019

Code:

Aug  4 00:18:16 scramjet corosync[1472229]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 1500 bytes for host 2 link 0 but the other node is not acknowledging packets of this size.
Aug  4 00:18:16 scramjet corosync[1472229]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run but performances might be affected.
Aug  4 00:18:16 scramjet corosync[1472229]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 1366 to 1350
Aug  4 00:18:16 scramjet corosync[1472229]:   [KNET  ] pmtud: Global data MTU changed to: 1350
Aug  4 00:18:46 scramjet corosync[1472229]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 1350 to 1366
Aug  4 00:18:46 scramjet corosync[1472229]:   [KNET  ] pmtud: Global data MTU changed to: 1366
Aug  4 00:30:07 scramjet corosync[1472229]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 1500 bytes for host 3 link 0 but the other node is not acknowledging packets of this size.
Aug  4 00:30:07 scramjet corosync[1472229]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run but performances might be affected.
Aug  4 00:30:07 scramjet corosync[1472229]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 1366 to 1350
Aug  4 00:30:07 scramjet corosync[1472229]:   [KNET  ] pmtud: Global data MTU changed to: 1350
Aug  4 00:30:29 scramjet corosync[1472229]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 1350 to 1366
Aug  4 00:30:29 scramjet corosync[1472229]:   [KNET  ] pmtud: Global data MTU changed to: 1366
Aug  4 01:21:35 scramjet corosync[1472229]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 1500 bytes for host 2 link 0 but the other node is not acknowledging packets of this size.
Aug  4 01:21:35 scramjet corosync[1472229]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run but performances might be affected.
Aug  4 01:21:35 scramjet corosync[1472229]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 1366 to 1350
Aug  4 01:21:35 scramjet corosync[1472229]:   [KNET  ] pmtud: Global data MTU changed to: 1350
Aug  4 01:22:05 scramjet corosync[1472229]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 1350 to 1366
Aug  4 01:22:05 scramjet corosync[1472229]:   [KNET  ] pmtud: Global data MTU changed to: 1366
Aug  4 03:42:11 scramjet corosync[1472229]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 1500 bytes for host 3 link 0 but the other node is not acknowledging packets of this size.
Aug  4 03:42:11 scramjet corosync[1472229]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run but performances might be affected.
Aug  4 03:42:20 scramjet corosync[1472229]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 1350 to 1366
Aug  4 03:42:20 scramjet corosync[1472229]:   [KNET  ] pmtud: Global data MTU changed to: 1366

I double checked, libknet is 1.10-pve2~test1 on all hosts. It lost quorum again this morning and corosync needed to be restarted on all 3 hosts to return to quorum

WesC · Aug 4, 2019

This morning the cluster is green, no hosts marked as offline. I hope this means the specific issue with knet pmtud and crypto has been resolved, and the floating point exception I saw over the weekend was an anomaly. When I get to the office this morning I'll dig through logs for any sign of issue.

Do you want me to keep running this version of libknet for a few more days before making a conclusion about it?

Thanks

Fabio M. Di Nitto · Aug 5, 2019

@WesC thanks for taking the time to test the packages. i am on the knet upstream maintainers and we are working very closely with @fabian to solve those issues.

Could you please collect more logs around the libknet crash reported by the kernel? a few dozen lines above the crash would be very useful.

At the moment we isolated 2 problems with PMTUd:

1) the calculation of MTU when using crypto is/was broken (pull request 242 that you already pointed out). I would not exclude the trap error might be cause by an combination of #1 and #2.
2) when using crypto, the timeout waiting for the other node reply is too short (that is causing those messages that the MTU is changing all the time)

We are currently waiting for test results around #1 (fix is ready) and working on #2. #2 is a bit more challenging in general.

As soon as we have a stable fix, we will do a new release for all of you guys.

Thanks for taking the time to report those issues and be so cooperative with testing potential fixes.

Cheers
Fabio

[SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Active Member

Active Member

Active Member

New Member