Failed to start Proxmox VE replication runner

jompsi

Active Member
Apr 15, 2013
33
0
26
Hi all

I have a three node cluster. In the web gui of node1 and node2, node3 is shown as offline, while the web gui of node3 shows node1 and node2 offline.

daemon.log of node3:
Code:
Jan  8 11:49:00 drax systemd[1]: Starting Proxmox VE replication runner...
Jan  8 11:49:02 drax pvesr[12250]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  8 11:49:03 drax pvesr[12250]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  8 11:49:04 drax pvesr[12250]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  8 11:49:05 drax pvesr[12250]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  8 11:49:06 drax pvesr[12250]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  8 11:49:07 drax pvesr[12250]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  8 11:49:08 drax pvesr[12250]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  8 11:49:09 drax pvesr[12250]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  8 11:49:10 drax pvesr[12250]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  8 11:49:11 drax pvesr[12250]: error with cfs lock 'file-replication_cfg': no quorum!
Jan  8 11:49:11 drax systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Jan  8 11:49:11 drax systemd[1]: Failed to start Proxmox VE replication runner.
Jan  8 11:49:11 drax systemd[1]: pvesr.service: Unit entered failed state.
Jan  8 11:49:11 drax systemd[1]: pvesr.service: Failed with result 'exit-code'.
This appears every minute. Node1 and node2 have no errors.

I have seen, that this could be a multicast(IGMP) problem, but I think this isnt the case here.

node3 - omping:
Code:
omping -c 10000 -i 0.001 -F -q 10.200.1.20 10.200.1.21 10.200.1.22

10.200.1.20 : waiting for response msg
10.200.1.21 : waiting for response msg
10.200.1.20 : joined (S,G) = (*, 232.43.211.234), pinging
10.200.1.21 : joined (S,G) = (*, 232.43.211.234), pinging
10.200.1.20 : given amount of query messages was sent
10.200.1.21 : waiting for response msg
10.200.1.21 : server told us to stop

10.200.1.20 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.087/0.149/1.495/0.045
10.200.1.20 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.087/0.248/6.404/0.750
10.200.1.21 :   unicast, xmt/rcv/%loss = 9029/9029/0%, min/avg/max/std-dev = 0.076/0.112/1.469/0.033
10.200.1.21 : multicast, xmt/rcv/%loss = 9029/9029/0%, min/avg/max/std-dev = 0.087/0.121/2.116/0.036

node3 - pvecm status:
Code:
pvecm status
Quorum information
------------------
Date:             Tue Jan  8 11:56:56 2019
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000003
Ring ID:          1/1628
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.200.1.20
0x00000002          1 10.200.1.21
0x00000003          1 10.200.1.22 (local)

corosync:
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: antman
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.200.1.21
  }
  node {
    name: drax
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.200.1.22
  }
  node {
    name: rocket
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.200.1.20
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: proxmox-cluster
  config_version: 3
  interface {
    bindnetaddr: 10.200.1.20
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}
On all three nodes the config_version is 3.

I am not using HA.

pveversion
node1 - pve-manager/5.2-10/6f892b40 (running kernel: 4.15.18-7-pve)
node2 - pve-manager/5.3-5/97ae681d (running kernel: 4.15.18-9-pve)
node3 - pve-manager/5.3-5/97ae681d (running kernel: 4.15.18-9-pve)

I have really no clue, what the problem could be. Does anybody know this problem? I would be glad, if someone could help or guide me in the right direction.

Best regards
Joel
 
Hi Stoiko

I have no storage replication. I only have some NFS storages for backing up the VMs. All the VMs are running on local storage on the nodes.

storage.cfg:
Code:
dir: local
        path /var/lib/vz
        content iso,backup,vztmpl

lvmthin: local-lvm
        thinpool data
        vgname pve
        content images,rootdir

dir: rocket-local
        path /mnt/rocket-local
        content images
        shared 0

dir: antman-local
        path /mnt/antman-local
        content images
        shared 0

nfs: cube18-bkps
        export /volume1/Virtualization/Backups
        path /mnt/pve/cube18-bkps
        server 10.200.1.28
        content backup
        maxfiles 2
        options vers=3

nfs: ISO-cube18
        export /volume1/Virtualization/ISO
        path /mnt/pve/ISO-cube18
        server cube18
        content iso
        options vers=3

nfs: BackupServer
        export /volume1/SysBackups/VM
        path /mnt/pve/BackupServer
        server backupserver
        content backup
        maxfiles 5
        options vers=3

nfs: cube18-vms
        export /volume1/Virtualization/VMs
        path /mnt/pve/cube18-vms
        server cube18.comp.local
        content images
        options vers=3

It seems to me, like there is no cpu usage. (This node has only two vms):
Code:
top
%Cpu0  :  0.3 us,  0.7 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu2  :  0.7 us,  0.3 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  0.3 us,  0.3 sy,  0.0 ni, 99.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  :  0.3 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  :  0.3 us,  0.3 sy,  0.0 ni, 99.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu8  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu9  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu10 :  0.0 us,  0.7 sy,  0.0 ni, 99.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 :  1.3 us,  0.3 sy,  0.0 ni, 98.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu12 :  0.3 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu13 :  0.3 us,  0.3 sy,  0.0 ni, 99.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu14 :  0.3 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu15 :  1.3 us,  0.3 sy,  0.0 ni, 98.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu16 :  1.0 us,  0.0 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu17 :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu18 :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu19 :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu20 :  0.3 us,  0.7 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu21 :  1.0 us,  0.7 sy,  0.0 ni, 98.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu22 :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu23 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

I think I have been once on this bugzilla page, but honestly, I dont know what I have to do. In my /etc/pve is no
replication.cfg.

Thanks and kind regards
Joel
 
hmm,

Is the cluster status ok without any interruptions?
* check the journal for log entries from `pve-cluster.service` and pmxcfs, and corosync.service
`journalctl -r -u corosync -u pve-cluster` (-r reverses the order (newest first) and the -u select the units)

* can you write in `/etc/pve/`? (e.g. `echo "test" > /etc/pve/testfile ; cat /etc/pve/testfile` on node3?
 
Good morning Stoiko

I have done everything you told me. For the results see below. During these steps I realized in the journal for the pve-cluste on node3 there were only entries from 23.12.2018. So I checked the service, which was running, but I than restarted it:
Code:
root@drax:/etc# service pve-cluster status
● pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2018-12-17 15:12:15 CET; 3 weeks 1 days ago
 Main PID: 1678 (pmxcfs)
    Tasks: 7 (limit: 9830)
   Memory: 55.3M
      CPU: 23min 18.139s
   CGroup: /system.slice/pve-cluster.service
           └─1678 /usr/bin/pmxcfs

Dec 23 02:48:41 drax pmxcfs[1678]: [dcdb] crit: cpg_initialize failed: 2
Dec 23 02:48:41 drax pmxcfs[1678]: [status] crit: cpg_initialize failed: 2
Dec 23 02:48:47 drax pmxcfs[1678]: [quorum] crit: quorum_initialize failed: 2
Dec 23 02:48:47 drax pmxcfs[1678]: [confdb] crit: cmap_initialize failed: 2
Dec 23 02:48:47 drax pmxcfs[1678]: [dcdb] crit: cpg_initialize failed: 2
Dec 23 02:48:47 drax pmxcfs[1678]: [status] crit: cpg_initialize failed: 2
Dec 23 02:48:53 drax pmxcfs[1678]: [quorum] crit: quorum_initialize failed: 2
Dec 23 02:48:53 drax pmxcfs[1678]: [confdb] crit: cmap_initialize failed: 2
Dec 23 02:48:53 drax pmxcfs[1678]: [dcdb] crit: cpg_initialize failed: 2
Dec 23 02:48:53 drax pmxcfs[1678]: [status] crit: cpg_initialize failed: 2
root@drax:/etc# service pve-cluster restart
root@drax:/etc# service pve-cluster status
● pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2019-01-09 08:57:10 CET; 3s ago
  Process: 48534 ExecStartPost=/usr/bin/pvecm updatecerts --silent (code=exited, status=0/SUCCESS)
  Process: 48510 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
 Main PID: 48520 (pmxcfs)
    Tasks: 6 (limit: 9830)
   Memory: 11.3M
      CPU: 777ms
   CGroup: /system.slice/pve-cluster.service
           └─48520 /usr/bin/pmxcfs

Jan 09 08:57:08 drax pmxcfs[48520]: [status] notice: received sync request (epoch 1/1905/00000025)
Jan 09 08:57:08 drax pmxcfs[48520]: [dcdb] notice: received all states
Jan 09 08:57:08 drax pmxcfs[48520]: [dcdb] notice: leader is 1/1905
Jan 09 08:57:08 drax pmxcfs[48520]: [dcdb] notice: synced members: 1/1905, 2/1786
Jan 09 08:57:08 drax pmxcfs[48520]: [dcdb] notice: waiting for updates from leader
Jan 09 08:57:08 drax pmxcfs[48520]: [status] notice: received all states
Jan 09 08:57:08 drax pmxcfs[48520]: [status] notice: all data is up to date
Jan 09 08:57:08 drax pmxcfs[48520]: [dcdb] notice: update complete - trying to commit (got 14 inode updates)
Jan 09 08:57:08 drax pmxcfs[48520]: [dcdb] notice: all data is up to date
Jan 09 08:57:10 drax systemd[1]: Started The Proxmox VE cluster filesystem.

And know in the web gui all nodes are online again. Do you have an idea what caused this?

Thank you very much and kind regards
Joel

----- Steps you suggested -----
I have the feeling the cluster status is ok.

node3 - journalctl -r -u pve-cluster
Code:
-- Logs begin at Mon 2018-12-17 15:12:07 CET, end at Wed 2019-01-09 08:45:10 CET. --
Dec 23 02:48:53 drax pmxcfs[1678]: [status] crit: cpg_initialize failed: 2
Dec 23 02:48:53 drax pmxcfs[1678]: [dcdb] crit: cpg_initialize failed: 2
Dec 23 02:48:53 drax pmxcfs[1678]: [confdb] crit: cmap_initialize failed: 2
Dec 23 02:48:53 drax pmxcfs[1678]: [quorum] crit: quorum_initialize failed: 2
Dec 23 02:48:47 drax pmxcfs[1678]: [status] crit: cpg_initialize failed: 2
Dec 23 02:48:47 drax pmxcfs[1678]: [dcdb] crit: cpg_initialize failed: 2
Dec 23 02:48:47 drax pmxcfs[1678]: [confdb] crit: cmap_initialize failed: 2
Dec 23 02:48:47 drax pmxcfs[1678]: [quorum] crit: quorum_initialize failed: 2
Dec 23 02:48:41 drax pmxcfs[1678]: [status] crit: cpg_initialize failed: 2
Dec 23 02:48:41 drax pmxcfs[1678]: [dcdb] crit: cpg_initialize failed: 2
Dec 23 02:48:41 drax pmxcfs[1678]: [confdb] crit: cmap_initialize failed: 2
Dec 23 02:48:41 drax pmxcfs[1678]: [quorum] crit: quorum_initialize failed: 2
Dec 23 02:48:35 drax pmxcfs[1678]: [status] crit: cpg_initialize failed: 2
Dec 23 02:48:35 drax pmxcfs[1678]: [dcdb] crit: cpg_initialize failed: 2
Dec 23 02:48:35 drax pmxcfs[1678]: [confdb] crit: cmap_initialize failed: 2
Dec 23 02:48:35 drax pmxcfs[1678]: [quorum] crit: quorum_initialize failed: 2
Dec 23 02:48:29 drax pmxcfs[1678]: [status] crit: cpg_initialize failed: 2
Dec 23 02:48:29 drax pmxcfs[1678]: [dcdb] crit: cpg_initialize failed: 2
Seems strange to me, that

node2 - journalctl -r -u pve-cluster
Code:
-- Logs begin at Mon 2018-12-17 20:29:34 CET, end at Wed 2019-01-09 08:47:01 CET. --
Jan 09 08:17:47 antman pmxcfs[1786]: [dcdb] notice: data verification successful
Jan 09 07:17:47 antman pmxcfs[1786]: [dcdb] notice: data verification successful
Jan 09 06:17:47 antman pmxcfs[1786]: [dcdb] notice: data verification successful
Jan 09 05:57:13 antman pmxcfs[1786]: [status] notice: received log
Jan 09 05:57:08 antman pmxcfs[1786]: [status] notice: received log
Jan 09 05:17:47 antman pmxcfs[1786]: [dcdb] notice: data verification successful
Jan 09 04:17:47 antman pmxcfs[1786]: [dcdb] notice: data verification successful
Jan 09 03:17:47 antman pmxcfs[1786]: [dcdb] notice: data verification successful
Jan 09 02:17:47 antman pmxcfs[1786]: [dcdb] notice: data verification successful
Jan 09 01:17:47 antman pmxcfs[1786]: [dcdb] notice: data verification successful
Jan 09 01:05:51 antman pmxcfs[1786]: [status] notice: received log
Jan 09 01:00:02 antman pmxcfs[1786]: [status] notice: received log
Jan 09 01:00:02 antman pmxcfs[1786]: [status] notice: received log
Jan 09 00:17:48 antman pmxcfs[1786]: [dcdb] notice: data verification successful
Jan 08 23:17:47 antman pmxcfs[1786]: [dcdb] notice: data verification successful
Jan 08 22:17:47 antman pmxcfs[1786]: [dcdb] notice: data verification successful
Jan 08 21:17:47 antman pmxcfs[1786]: [dcdb] notice: data verification successful
Jan 08 20:17:47 antman pmxcfs[1786]: [dcdb] notice: data verification successful
Jan 08 19:17:47 antman pmxcfs[1786]: [dcdb] notice: data verification successful
Jan 08 18:17:47 antman pmxcfs[1786]: [dcdb] notice: data verification successful
Jan 08 17:17:47 antman pmxcfs[1786]: [dcdb] notice: data verification successful
Jan 08 16:17:47 antman pmxcfs[1786]: [dcdb] notice: data verification successful
Jan 08 15:17:47 antman pmxcfs[1786]: [dcdb] notice: data verification successful
Jan 08 14:17:47 antman pmxcfs[1786]: [dcdb] notice: data verification successful
Jan 08 13:17:47 antman pmxcfs[1786]: [dcdb] notice: data verification successful
Jan 08 12:17:47 antman pmxcfs[1786]: [dcdb] notice: data verification successful
Jan 08 11:41:10 antman pmxcfs[1786]: [status] notice: all data is up to date
Jan 08 11:41:10 antman pmxcfs[1786]: [status] notice: received all states
Jan 08 11:41:10 antman pmxcfs[1786]: [dcdb] notice: all data is up to date
Jan 08 11:41:10 antman pmxcfs[1786]: [dcdb] notice: update complete - trying to commit (got 2 inode updates)
Jan 08 11:41:10 antman pmxcfs[1786]: [dcdb] notice: waiting for updates from leader
Jan 08 11:41:10 antman pmxcfs[1786]: [dcdb] notice: synced members: 1/1905
Jan 08 11:41:10 antman pmxcfs[1786]: [dcdb] notice: leader is 1/1905
Jan 08 11:41:10 antman pmxcfs[1786]: [dcdb] notice: received all states
Jan 08 11:41:10 antman pmxcfs[1786]: [status] notice: received sync request (epoch 1/1905/00000024)
Jan 08 11:41:10 antman pmxcfs[1786]: [dcdb] notice: received sync request (epoch 1/1905/00000024)
Jan 08 11:41:10 antman pmxcfs[1786]: [status] notice: starting data syncronisation
Jan 08 11:41:10 antman pmxcfs[1786]: [status] notice: members: 1/1905, 2/1786
Jan 08 11:41:10 antman pmxcfs[1786]: [dcdb] notice: starting data syncronisation
Jan 08 11:41:10 antman pmxcfs[1786]: [dcdb] notice: members: 1/1905, 2/1786
Jan 08 11:41:10 antman pmxcfs[1786]: [status] notice: node has quorum
Jan 08 11:41:10 antman pmxcfs[1786]: [status] notice: update cluster info (cluster name  proxmox-cluster, version = 3)
Jan 08 11:41:04 antman pmxcfs[1786]: [status] crit: can't initialize service
Jan 08 11:41:04 antman pmxcfs[1786]: [status] crit: cpg_initialize failed: 2
Jan 08 11:41:04 antman pmxcfs[1786]: [status] notice: start cluster connection
Jan 08 11:41:04 antman pmxcfs[1786]: [dcdb] crit: can't initialize service
Jan 08 11:41:04 antman pmxcfs[1786]: [dcdb] crit: cpg_initialize failed: 2
Jan 08 11:41:04 antman pmxcfs[1786]: [dcdb] notice: start cluster connection
Jan 08 11:41:04 antman pmxcfs[1786]: [confdb] crit: can't initialize service
Jan 08 11:41:04 antman pmxcfs[1786]: [confdb] crit: cmap_initialize failed: 2
Jan 08 11:41:04 antman pmxcfs[1786]: [quorum] crit: can't initialize service
Jan 08 11:41:04 antman pmxcfs[1786]: [quorum] crit: quorum_initialize failed: 2
Jan 08 11:41:04 antman pmxcfs[1786]: [status] notice: node lost quorum
Jan 08 11:41:04 antman pmxcfs[1786]: [quorum] crit: quorum_dispatch failed: 2
Jan 08 11:41:04 antman pmxcfs[1786]: [dcdb] crit: cpg_leave failed: 2
Jan 08 11:41:04 antman pmxcfs[1786]: [dcdb] crit: cpg_dispatch failed: 2

node1 - journalctl -r -u pve-cluster
Code:
-- Logs begin at Fri 2018-11-09 11:17:43 CET, end at Wed 2019-01-09 08:41:01 CET. --
Jan 09 08:17:47 rocket pmxcfs[1905]: [dcdb] notice: data verification successful
Jan 09 07:17:47 rocket pmxcfs[1905]: [dcdb] notice: data verification successful
Jan 09 06:17:47 rocket pmxcfs[1905]: [dcdb] notice: data verification successful
Jan 09 05:17:47 rocket pmxcfs[1905]: [dcdb] notice: data verification successful
Jan 09 04:17:47 rocket pmxcfs[1905]: [dcdb] notice: data verification successful
Jan 09 03:41:47 rocket pmxcfs[1905]: [status] notice: received log
Jan 09 03:41:38 rocket pmxcfs[1905]: [status] notice: received log
Jan 09 03:17:47 rocket pmxcfs[1905]: [dcdb] notice: data verification successful
Jan 09 02:17:47 rocket pmxcfs[1905]: [dcdb] notice: data verification successful
Jan 09 01:17:47 rocket pmxcfs[1905]: [dcdb] notice: data verification successful
Jan 09 00:17:47 rocket pmxcfs[1905]: [dcdb] notice: data verification successful
Jan 08 23:17:47 rocket pmxcfs[1905]: [dcdb] notice: data verification successful
Jan 08 22:17:47 rocket pmxcfs[1905]: [dcdb] notice: data verification successful
Jan 08 21:17:47 rocket pmxcfs[1905]: [dcdb] notice: data verification successful
Jan 08 20:17:47 rocket pmxcfs[1905]: [dcdb] notice: data verification successful
Jan 08 19:17:47 rocket pmxcfs[1905]: [dcdb] notice: data verification successful
Jan 08 18:57:37 rocket pmxcfs[1905]: [status] notice: received log
Jan 08 18:42:42 rocket pmxcfs[1905]: [status] notice: received log
Jan 08 18:17:47 rocket pmxcfs[1905]: [dcdb] notice: data verification successful
Jan 08 17:17:47 rocket pmxcfs[1905]: [dcdb] notice: data verification successful
Jan 08 16:17:47 rocket pmxcfs[1905]: [dcdb] notice: data verification successful
Jan 08 15:17:47 rocket pmxcfs[1905]: [dcdb] notice: data verification successful
Jan 08 14:17:47 rocket pmxcfs[1905]: [dcdb] notice: data verification successful
Jan 08 13:17:47 rocket pmxcfs[1905]: [dcdb] notice: data verification successful
Jan 08 13:04:45 rocket pmxcfs[1905]: [status] notice: received log
Jan 08 12:49:45 rocket pmxcfs[1905]: [status] notice: received log
Jan 08 12:34:44 rocket pmxcfs[1905]: [status] notice: received log
Jan 08 12:19:43 rocket pmxcfs[1905]: [status] notice: received log
Jan 08 12:17:47 rocket pmxcfs[1905]: [dcdb] notice: data verification successful
Jan 08 12:04:42 rocket pmxcfs[1905]: [status] notice: received log
Jan 08 11:49:41 rocket pmxcfs[1905]: [status] notice: received log
Jan 08 11:41:10 rocket pmxcfs[1905]: [status] notice: all data is up to date
Jan 08 11:41:10 rocket pmxcfs[1905]: [status] notice: received all states
Jan 08 11:41:10 rocket pmxcfs[1905]: [dcdb] notice: all data is up to date
Jan 08 11:41:10 rocket pmxcfs[1905]: [dcdb] notice: sent all (2) updates
Jan 08 11:41:10 rocket pmxcfs[1905]: [dcdb] notice: start sending inode updates
Jan 08 11:41:10 rocket pmxcfs[1905]: [dcdb] notice: synced members: 1/1905
Jan 08 11:41:10 rocket pmxcfs[1905]: [dcdb] notice: leader is 1/1905
Jan 08 11:41:10 rocket pmxcfs[1905]: [dcdb] notice: received all states
Jan 08 11:41:10 rocket pmxcfs[1905]: [status] notice: received sync request (epoch 1/1905/00000024)
Jan 08 11:41:10 rocket pmxcfs[1905]: [dcdb] notice: received sync request (epoch 1/1905/00000024)
Jan 08 11:41:10 rocket pmxcfs[1905]: [status] notice: starting data syncronisation
Jan 08 11:41:10 rocket pmxcfs[1905]: [status] notice: members: 1/1905, 2/1786
Jan 08 11:41:10 rocket pmxcfs[1905]: [dcdb] notice: starting data syncronisation
Jan 08 11:41:10 rocket pmxcfs[1905]: [dcdb] notice: members: 1/1905, 2/1786
Jan 08 11:41:04 rocket pmxcfs[1905]: [status] notice: members: 1/1905
Jan 08 11:41:04 rocket pmxcfs[1905]: [dcdb] notice: members: 1/1905
Jan 08 11:41:02 rocket pmxcfs[1905]: [status] notice: all data is up to date
Jan 08 11:41:02 rocket pmxcfs[1905]: [status] notice: received all states
Jan 08 11:41:02 rocket pmxcfs[1905]: [dcdb] notice: all data is up to date
Jan 08 11:41:02 rocket pmxcfs[1905]: [dcdb] notice: update complete - trying to commit (got 1 inode updates)
Jan 08 11:41:02 rocket pmxcfs[1905]: [dcdb] notice: waiting for updates from leader
Jan 08 11:41:02 rocket pmxcfs[1905]: [dcdb] notice: synced members: 2/1786
Jan 08 11:41:02 rocket pmxcfs[1905]: [dcdb] notice: leader is 2/1786
Jan 08 11:41:02 rocket pmxcfs[1905]: [dcdb] notice: received all states
Jan 08 11:41:02 rocket pmxcfs[1905]: [status] notice: received sync request (epoch 1/1905/00000022)

node3 - journalctl -r -u corosync
Code:
-- Logs begin at Mon 2018-12-17 15:12:07 CET, end at Wed 2019-01-09 08:46:10 CET. --
Jan 09 08:35:36 drax corosync[10596]:  [TOTEM ] Retransmit List: 48191
Jan 09 08:35:36 drax corosync[10596]: notice  [TOTEM ] Retransmit List: 48191
Jan 09 08:31:11 drax corosync[10596]:  [TOTEM ] Retransmit List: 47d8e
Jan 09 08:31:11 drax corosync[10596]: notice  [TOTEM ] Retransmit List: 47d8e
Jan 09 06:26:46 drax corosync[10596]:  [TOTEM ] Retransmit List: 40b33
Jan 09 06:26:46 drax corosync[10596]: notice  [TOTEM ] Retransmit List: 40b33
Jan 09 06:26:41 drax corosync[10596]:  [TOTEM ] Retransmit List: 40b2f
Jan 09 06:26:41 drax corosync[10596]: notice  [TOTEM ] Retransmit List: 40b2f
Jan 09 06:25:36 drax corosync[10596]:  [TOTEM ] Retransmit List: 40a34 40a35
Jan 09 06:25:36 drax corosync[10596]: notice  [TOTEM ] Retransmit List: 40a34 40a35
Jan 09 03:00:16 drax corosync[10596]: notice  [TOTEM ] Retransmit List: 34d9f 34da0 34da1 34da2
Jan 09 03:00:16 drax corosync[10596]:  [TOTEM ] Retransmit List: 34d9f 34da0 34da1 34da2
Jan 09 03:00:16 drax corosync[10596]:  [TOTEM ] Retransmit List: 34d9f 34da0 34da1 34da2
Jan 09 03:00:16 drax corosync[10596]: notice  [TOTEM ] Retransmit List: 34d9f 34da0 34da1 34da2
Jan 08 21:17:46 drax corosync[10596]:  [TOTEM ] Retransmit List: 21302 21303 21304
Jan 08 21:17:46 drax corosync[10596]:  [TOTEM ] Retransmit List: 21301
Jan 08 21:17:46 drax corosync[10596]: notice  [TOTEM ] Retransmit List: 21302 21303 21304
Jan 08 21:17:46 drax corosync[10596]: notice  [TOTEM ] Retransmit List: 21301
Jan 08 20:43:46 drax corosync[10596]:  [TOTEM ] Retransmit List: 1f3bd 1f3be 1f3bf 1f3c0
Jan 08 20:43:46 drax corosync[10596]: notice  [TOTEM ] Retransmit List: 1f3bd 1f3be 1f3bf 1f3c0
Jan 08 20:40:26 drax corosync[10596]:  [TOTEM ] Retransmit List: 1f0a6 1f0a7
Jan 08 20:40:26 drax corosync[10596]: notice  [TOTEM ] Retransmit List: 1f0a6 1f0a7
Jan 08 18:12:46 drax corosync[10596]:  [TOTEM ] Retransmit List: 168f9 168fa 168fb 168fc
Jan 08 18:12:46 drax corosync[10596]: notice  [TOTEM ] Retransmit List: 168f9 168fa 168fb 168fc
node1 looks more or less the same.

But node2 differs here.
node2 - journalctl -r -u corosync
Code:
-- Logs begin at Mon 2018-12-17 20:29:34 CET, end at Wed 2019-01-09 08:48:01 CET. --
Jan 08 11:41:04 antman corosync[34324]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jan 08 11:41:04 antman corosync[34324]:  [QUORUM] Members[3]: 1 2 3
Jan 08 11:41:04 antman corosync[34324]:  [QUORUM] This node is within the primary component and will provide service.
Jan 08 11:41:04 antman corosync[34324]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Jan 08 11:41:04 antman corosync[34324]: notice  [QUORUM] Members[3]: 1 2 3
Jan 08 11:41:04 antman corosync[34324]: notice  [QUORUM] This node is within the primary component and will provide service.
Jan 08 11:41:04 antman corosync[34324]:  [CPG   ] downlist left_list: 0 received
Jan 08 11:41:04 antman corosync[34324]: warning [CPG   ] downlist left_list: 0 received
Jan 08 11:41:04 antman corosync[34324]: warning [CPG   ] downlist left_list: 0 received
Jan 08 11:41:04 antman corosync[34324]:  [CPG   ] downlist left_list: 0 received
Jan 08 11:41:04 antman corosync[34324]:  [CPG   ] downlist left_list: 0 received
Jan 08 11:41:04 antman corosync[34324]: warning [CPG   ] downlist left_list: 0 received
Jan 08 11:41:04 antman corosync[34324]:  [TOTEM ] A new membership (10.200.1.20:1628) was formed. Members joined: 1 3
Jan 08 11:41:04 antman corosync[34324]: notice  [TOTEM ] A new membership (10.200.1.20:1628) was formed. Members joined: 1 3
Jan 08 11:41:04 antman corosync[34324]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jan 08 11:41:04 antman corosync[34324]:  [QUORUM] Members[1]: 2
Jan 08 11:41:04 antman corosync[34324]:  [CPG   ] downlist left_list: 0 received
Jan 08 11:41:04 antman corosync[34324]:  [TOTEM ] A new membership (10.200.1.21:1624) was formed. Members joined: 2
Jan 08 11:41:04 antman corosync[34324]:  [QB    ] server name: quorum
Jan 08 11:41:04 antman corosync[34324]:  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jan 08 11:41:04 antman corosync[34324]:  [QB    ] server name: votequorum
Jan 08 11:41:04 antman corosync[34324]:  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jan 08 11:41:04 antman corosync[34324]:  [QUORUM] Using quorum provider corosync_votequorum
Jan 08 11:41:04 antman corosync[34324]:  [SERV  ] Service engine loaded: corosync watchdog service [7]
Jan 08 11:41:04 antman corosync[34324]:  [WD    ] no resources configured.
Jan 08 11:41:04 antman corosync[34324]:  [WD    ] resource memory_used missing a recovery key.
Jan 08 11:41:04 antman corosync[34324]:  [WD    ] resource load_15min missing a recovery key.
Jan 08 11:41:04 antman corosync[34324]:  [WD    ] Watchdog not enabled by configuration
Jan 08 11:41:04 antman corosync[34324]:  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Jan 08 11:41:04 antman corosync[34324]:  [SERV  ] Service engine loaded: corosync profile loading service [4]
Jan 08 11:41:04 antman corosync[34324]:  [QB    ] server name: cpg
Jan 08 11:41:04 antman corosync[34324]:  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jan 08 11:41:04 antman corosync[34324]:  [QB    ] server name: cfg
Jan 08 11:41:04 antman corosync[34324]:  [SERV  ] Service engine loaded: corosync configuration service [1]
Jan 08 11:41:04 antman corosync[34324]:  [QB    ] server name: cmap
Jan 08 11:41:04 antman corosync[34324]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Jan 08 11:41:04 antman corosync[34324]: notice  [QUORUM] Members[1]: 2
Jan 08 11:41:04 antman corosync[34324]: warning [CPG   ] downlist left_list: 0 received
Jan 08 11:41:04 antman systemd[1]: Started Corosync Cluster Engine.
Jan 08 11:41:04 antman corosync[34324]:  [SERV  ] Service engine loaded: corosync configuration map access [0]
Jan 08 11:41:04 antman corosync[34324]: notice  [TOTEM ] A new membership (10.200.1.21:1624) was formed. Members joined: 2
Jan 08 11:41:04 antman corosync[34324]: info    [QB    ] server name: quorum
Jan 08 11:41:04 antman corosync[34324]: notice  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jan 08 11:41:04 antman corosync[34324]: info    [QB    ] server name: votequorum
Jan 08 11:41:04 antman corosync[34324]: notice  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jan 08 11:41:04 antman corosync[34324]: notice  [QUORUM] Using quorum provider corosync_votequorum
Jan 08 11:41:04 antman corosync[34324]: notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
Jan 08 11:41:04 antman corosync[34324]: info    [WD    ] no resources configured.
Jan 08 11:41:04 antman corosync[34324]: warning [WD    ] resource memory_used missing a recovery key.
Jan 08 11:41:04 antman corosync[34324]: warning [WD    ] resource load_15min missing a recovery key.
Jan 08 11:41:04 antman corosync[34324]: warning [WD    ] Watchdog not enabled by configuration
Jan 08 11:41:04 antman corosync[34324]: notice  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Jan 08 11:41:04 antman corosync[34324]: notice  [SERV  ] Service engine loaded: corosync profile loading service [4]
Jan 08 11:41:04 antman corosync[34324]: info    [QB    ] server name: cpg
Jan 08 11:41:04 antman corosync[34324]: notice  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jan 08 11:41:04 antman corosync[34324]: info    [QB    ] server name: cfg

If I try to write a file in /etc/pve on node3 I get the following error:
Code:
root@drax:/etc/pve# echo "test" > /etc/pve/testfile
-bash: /etc/pve/testfile: Permission denied
It works on node1 and node2.
 
Jan 08 20:43:46 drax corosync[10596]: [TOTEM ] Retransmit List: 1f3bd 1f3be 1f3bf 1f3c0 Jan 08 20:43:46 drax corosync[10596]: notice [TOTEM ] Retransmit List: 1f3bd 1f3be 1f3bf 1f3c0
Those loglines point to cluster-network problems (despite the omping working...).

* please try the longer running omping test from our documentation: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network
* If the corosync network is shared with other functions (storage, backup, VM internet acces, nfs, ceph,....) it is not uncommon that a load in one of these components causes the latency for corosync to go up and thus yield retransmits and loss of quorum - please consider putting corosync on a ring of its own
 
I have executed the longer running omping which has 0% loss:
Code:
omping -c 600 -i 1 -q 10.200.1.20 10.200.1.21 10.200.1.22
10.200.1.20 : waiting for response msg
10.200.1.21 : waiting for response msg
10.200.1.20 : joined (S,G) = (*, 232.43.211.234), pinging
10.200.1.21 : joined (S,G) = (*, 232.43.211.234), pinging
10.200.1.20 : given amount of query messages was sent
10.200.1.21 : given amount of query messages was sent

10.200.1.20 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.112/0.285/3.998/0.265
10.200.1.20 : multicast, xmt/rcv/%loss = 600/599/0% (seq>=2 0%), min/avg/max/std-dev = 0.114/0.287/3.536/0.240
10.200.1.21 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.098/0.249/1.747/0.101
10.200.1.21 : multicast, xmt/rcv/%loss = 600/599/0% (seq>=2 0%), min/avg/max/std-dev = 0.104/0.251/1.749/0.094

All nodes have four 1gb network interfaces which are configured like following:
Code:
auto lo
iface lo inet loopback

iface eno1 inet manual

iface eno2 inet manual

iface eno3 inet manual

iface eno4 inet manual

auto bond0
iface bond0 inet manual
        slaves eno1 eno2 eno3 eno4
        bond_miimon 100
        bond_mode 802.3ad
        bond_xmit_hash_policy layer2+3

auto vmbr0
iface vmbr0 inet static
        address  10.200.1.22
        netmask  255.255.255.0
        gateway  10.200.1.1
        bridge_ports bond0
        bridge_stp off
        bridge_fd 0

I should consider creating a VLAN for only the corosync traffic? Is this possible to do now? Because it is a productive environment :/

About the network configuration I started once this thread:
https://forum.proxmox.com/threads/cluster-creation-no-bond-supported.47217/#post-222677
 
I should consider creating a VLAN for only the corosync traffic? Is this possible to do now? Because it is a productive environment :/
* VLAN separation will probably not help - if the network is loaded and the latency is high, the separate VLAN tagged packages still go through the same line (unless you configure some kind of QOS and prioritization (and that you could do for the corosync traffic).
* If you have one additional interface on each server you could add a second corosync ring (check the cluster-documentation I posted above)

Hm - a 4 port LACP bond - it could happen, that the corosync packets get output on a different NIC than the omping packets (I just checked omping and corosync use different multicast-addresses) - and with lacp (and bond_xmit_hash_policy layer2+3 the IP information is part of the hash)
* maybe check whether you see some errors on one of the bond interfaces (eno1 eno2 eno3 eno4 - ip -details -statistics link should provide some counters , else ethtool can also help)

In the longer run you could consider taking one link out of the bond and dedicating it to corosync
 
Instead of a vlan I should use an independent network for corosync. Corosync communicates over the ring0_addr addresses, if I am right.

I will consider your inputs for the future. I am not sure yet, how I will solve it, but at least now all three nodes are online again and I know I have to take care on corosync.

There are no errors on the single nics:
Code:
root@drax:/etc/pve# ip -details -statistics link show dev eno1
2: eno1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
    link/ether 40:a8:f0:2e:09:d8 brd ff:ff:ff:ff:ff:ff promiscuity 1
    bond_slave state ACTIVE mii_status UP link_failure_count 0 perm_hwaddr 40:a8:f0:2e:09:d8 queue_id 0 ad_aggregator_id 1 ad_actor_oper_port_state 61 ad_partner_oper_port_state 61 addrgenmode eui64 numtxqueues 5 numrxqueues 5 gso_max_size 65536 gso_max_segs 65535
    RX: bytes  packets  errors  dropped overrun mcast
    19226960915 51419295 0       0       0       12698798
    TX: bytes  packets  errors  dropped carrier collsns
    16594764496 70526656 0       0       0       0
root@drax:/etc/pve# ip -details -statistics link show dev eno2
3: eno2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
    link/ether 40:a8:f0:2e:09:d8 brd ff:ff:ff:ff:ff:ff promiscuity 1
    bond_slave state ACTIVE mii_status UP link_failure_count 0 perm_hwaddr 40:a8:f0:2e:09:d9 queue_id 0 ad_aggregator_id 1 ad_actor_oper_port_state 61 ad_partner_oper_port_state 61 addrgenmode eui64 numtxqueues 5 numrxqueues 5 gso_max_size 65536 gso_max_segs 65535
    RX: bytes  packets  errors  dropped overrun mcast
    99893843888 100796826 0       0       0       5590170
    TX: bytes  packets  errors  dropped carrier collsns
    20181194415 35479146 0       0       0       0
root@drax:/etc/pve# ip -details -statistics link show dev eno3
4: eno3: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
    link/ether 40:a8:f0:2e:09:d8 brd ff:ff:ff:ff:ff:ff promiscuity 1
    bond_slave state ACTIVE mii_status UP link_failure_count 0 perm_hwaddr 40:a8:f0:2e:09:da queue_id 0 ad_aggregator_id 1 ad_actor_oper_port_state 61 ad_partner_oper_port_state 61 addrgenmode eui64 numtxqueues 5 numrxqueues 5 gso_max_size 65536 gso_max_segs 65535
    RX: bytes  packets  errors  dropped overrun mcast
    2997116400 31967401 0       0       0       937996
    TX: bytes  packets  errors  dropped carrier collsns
    4188781689 38675527 0       0       0       0
root@drax:/etc/pve# ip -details -statistics link show dev eno4
5: eno4: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
    link/ether 40:a8:f0:2e:09:d8 brd ff:ff:ff:ff:ff:ff promiscuity 1
    bond_slave state ACTIVE mii_status UP link_failure_count 0 perm_hwaddr 40:a8:f0:2e:09:db queue_id 0 ad_aggregator_id 1 ad_actor_oper_port_state 61 ad_partner_oper_port_state 61 addrgenmode eui64 numtxqueues 5 numrxqueues 5 gso_max_size 65536 gso_max_segs 65535
    RX: bytes  packets  errors  dropped overrun mcast
    6269396832 51974235 0       0       0       942155
    TX: bytes  packets  errors  dropped carrier collsns
    4853740763 42780307 0       0       0       0

Thanks a lot for your inputs and best regards
Joel
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!