[SOLVED] dcdb not syncing ?

bladux

Well-Known Member
Nov 7, 2016
30
0
46
42
Hi,

I have an issue with my cluster (5.1), newly updated, I have only 3 nodes synced (pmxfs / dcdb) out of 10 nodes up. The 7 remaining never catch up to date...
Nov 6 15:24:35 R1M1 pmxcfs[1236]: [dcdb] notice: synced members: 1/1236, 16/1222, 17/1207


It seems the updates are sent, but always the same amount of updates, and I've not seen any node catching up...
Nov 6 15:27:26 R1M1 pmxcfs[1236]: [dcdb] notice: sent all (0) updates
Nov 6 15:27:27 R1M1 pmxcfs[1236]: [dcdb] notice: start sending inode updates
Nov 6 15:27:27 R1M1 pmxcfs[1236]: [dcdb] notice: sent all (164) updates
Nov 6 15:27:27 R1M1 pmxcfs[1236]: [dcdb] notice: start sending inode updates
Nov 6 15:27:27 R1M1 pmxcfs[1236]: [dcdb] notice: sent all (0) updates
Nov 6 15:27:28 R1M1 pmxcfs[1236]: [dcdb] notice: start sending inode updates
Nov 6 15:27:28 R1M1 pmxcfs[1236]: [dcdb] notice: sent all (69) updates
Nov 6 15:27:28 R1M1 pmxcfs[1236]: [dcdb] notice: start sending inode updates
Nov 6 15:27:28 R1M1 pmxcfs[1236]: [dcdb] notice: sent all (0) updates
Nov 6 15:27:29 R1M1 pmxcfs[1236]: [dcdb] notice: start sending inode updates
Nov 6 15:27:29 R1M1 pmxcfs[1236]: [dcdb] notice: sent all (69) updates
Nov 6 15:27:29 R1M1 pmxcfs[1236]: [dcdb] notice: start sending inode updates
Nov 6 15:27:29 R1M1 pmxcfs[1236]: [dcdb] notice: sent all (0) updates
Nov 6 15:27:33 R1M1 pmxcfs[1236]: [dcdb] notice: start sending inode updates
Nov 6 15:27:33 R1M1 pmxcfs[1236]: [dcdb] notice: sent all (164) updates
Nov 6 15:27:33 R1M1 pmxcfs[1236]: [dcdb] notice: start sending inode updates
Nov 6 15:27:33 R1M1 pmxcfs[1236]: [dcdb] notice: sent all (0) updates
Nov 6 15:27:41 R1M1 pmxcfs[1236]: [dcdb] notice: start sending inode updates
Nov 6 15:27:41 R1M1 pmxcfs[1236]: [dcdb] notice: sent all (164) updates
Nov 6 15:27:41 R1M1 pmxcfs[1236]: [dcdb] notice: start sending inode updates
Nov 6 15:27:41 R1M1 pmxcfs[1236]: [dcdb] notice: sent all (0) updates
Nov 6 15:27:42 R1M1 pmxcfs[1236]: [dcdb] notice: start sending inode updates
Nov 6 15:27:42 R1M1 pmxcfs[1236]: [dcdb] notice: sent all (69) updates
Nov 6 15:27:42 R1M1 pmxcfs[1236]: [dcdb] notice: start sending inode updates
Nov 6 15:27:42 R1M1 pmxcfs[1236]: [dcdb] notice: sent all (0) updates
Nov 6 15:27:42 R1M1 pmxcfs[1236]: [dcdb] notice: start sending inode updates
Nov 6 15:27:42 R1M1 pmxcfs[1236]: [dcdb] notice: sent all (69) updates
Nov 6 15:27:42 R1M1 pmxcfs[1236]: [dcdb] notice: start sending inode updates
Nov 6 15:27:42 R1M1 pmxcfs[1236]: [dcdb] notice: sent all (0) updates

Here are the logs that seems to repeat over and over...
Nov 6 15:28:42 R1M1 pmxcfs[1236]: [dcdb] notice: dfsm_deliver_queue: queue length 1
Nov 6 15:28:42 R1M1 pmxcfs[1236]: [dcdb] notice: remove message from non-member 10/1302
Nov 6 15:28:44 R1M1 pmxcfs[1236]: [dcdb] notice: members: 1/1236, 12/1304, 16/1222, 17/1207
Nov 6 15:28:44 R1M1 pmxcfs[1236]: [dcdb] notice: starting data syncronisation
Nov 6 15:28:44 R1M1 pmxcfs[1236]: [dcdb] notice: received sync request (epoch 1/1236/00000EE0)
Nov 6 15:28:44 R1M1 pmxcfs[1236]: [dcdb] notice: received all states
Nov 6 15:28:44 R1M1 pmxcfs[1236]: [dcdb] notice: leader is 1/1236
Nov 6 15:28:44 R1M1 pmxcfs[1236]: [dcdb] notice: synced members: 1/1236, 16/1222, 17/1207
Nov 6 15:28:44 R1M1 pmxcfs[1236]: [dcdb] notice: start sending inode updates
Nov 6 15:28:44 R1M1 pmxcfs[1236]: [dcdb] notice: sent all (69) updates
Nov 6 15:28:44 R1M1 pmxcfs[1236]: [dcdb] notice: all data is up to date
Nov 6 15:28:44 R1M1 pmxcfs[1236]: [dcdb] notice: dfsm_deliver_queue: queue length 1
Nov 6 15:28:44 R1M1 pmxcfs[1236]: [dcdb] notice: dfsm_deliver_queue: queue length 1
Nov 6 15:28:44 R1M1 pmxcfs[1236]: [dcdb] notice: members: 1/1236, 16/1222, 17/1207
Nov 6 15:28:44 R1M1 pmxcfs[1236]: [dcdb] notice: starting data syncronisation
Nov 6 15:28:44 R1M1 pmxcfs[1236]: [dcdb] notice: received sync request (epoch 1/1236/00000EE1)
Nov 6 15:28:44 R1M1 pmxcfs[1236]: [dcdb] notice: received all states
Nov 6 15:28:44 R1M1 pmxcfs[1236]: [dcdb] notice: leader is 1/1236
Nov 6 15:28:44 R1M1 pmxcfs[1236]: [dcdb] notice: synced members: 1/1236, 16/1222, 17/1207
Nov 6 15:28:44 R1M1 pmxcfs[1236]: [dcdb] notice: start sending inode updates
Nov 6 15:28:44 R1M1 pmxcfs[1236]: [dcdb] notice: sent all (0) updates
Nov 6 15:28:44 R1M1 pmxcfs[1236]: [dcdb] notice: all data is up to date
Nov 6 15:28:44 R1M1 pmxcfs[1236]: [dcdb] notice: members: 1/1236, 3/1269, 16/1222, 17/1207
Nov 6 15:28:44 R1M1 pmxcfs[1236]: [dcdb] notice: starting data syncronisation
Nov 6 15:28:44 R1M1 pmxcfs[1236]: [dcdb] notice: received sync request (epoch 1/1236/00000EE2)
Nov 6 15:28:45 R1M1 pmxcfs[1236]: [dcdb] notice: members: 1/1236, 3/1269, 14/17145, 16/1222, 17/1207
Nov 6 15:28:45 R1M1 pmxcfs[1236]: [dcdb] notice: queue not emtpy - resening 1 messages
Nov 6 15:28:45 R1M1 pmxcfs[1236]: [dcdb] notice: received sync request (epoch 1/1236/00000EE3)
Nov 6 15:28:45 R1M1 pmxcfs[1236]: [dcdb] notice: members: 1/1236, 3/1269, 16/1222, 17/1207
Nov 6 15:28:45 R1M1 pmxcfs[1236]: [dcdb] notice: received sync request (epoch 1/1236/00000EE4)
Nov 6 15:28:48 R1M1 pmxcfs[1236]: [dcdb] notice: members: 1/1236, 3/1269, 10/1302, 16/1222, 17/1207
Nov 6 15:28:48 R1M1 pmxcfs[1236]: [dcdb] notice: queue not emtpy - resening 3 messages
Nov 6 15:28:48 R1M1 pmxcfs[1236]: [dcdb] notice: received sync request (epoch 1/1236/00000EE5)
Nov 6 15:28:48 R1M1 pmxcfs[1236]: [dcdb] notice: members: 1/1236, 3/1269, 16/1222, 17/1207
Nov 6 15:28:48 R1M1 pmxcfs[1236]: [dcdb] notice: received sync request (epoch 1/1236/00000EE6)
Nov 6 15:28:48 R1M1 pmxcfs[1236]: [dcdb] notice: members: 1/1236, 16/1222, 17/1207
Nov 6 15:28:48 R1M1 pmxcfs[1236]: [dcdb] notice: received sync request (epoch 1/1236/00000EE7)
Nov 6 15:28:48 R1M1 pmxcfs[1236]: [dcdb] notice: received all states
Nov 6 15:28:48 R1M1 pmxcfs[1236]: [dcdb] notice: leader is 1/1236
Nov 6 15:28:48 R1M1 pmxcfs[1236]: [dcdb] notice: synced members: 1/1236, 16/1222, 17/1207
Nov 6 15:28:48 R1M1 pmxcfs[1236]: [dcdb] notice: start sending inode updates
Nov 6 15:28:48 R1M1 pmxcfs[1236]: [dcdb] notice: sent all (0) updates
Nov 6 15:28:48 R1M1 pmxcfs[1236]: [dcdb] notice: all data is up to date
Nov 6 15:28:48 R1M1 pmxcfs[1236]: [dcdb] notice: dfsm_deliver_queue: queue length 4
Nov 6 15:28:48 R1M1 pmxcfs[1236]: [dcdb] notice: remove message from non-member 10/1302
Nov 6 15:28:50 R1M1 pmxcfs[1236]: [dcdb] notice: members: 1/1236, 12/1304, 16/1222, 17/1207
Nov 6 15:28:50 R1M1 pmxcfs[1236]: [dcdb] notice: starting data syncronisation
Nov 6 15:28:50 R1M1 pmxcfs[1236]: [dcdb] notice: received sync request (epoch 1/1236/00000EE8)
Nov 6 15:28:50 R1M1 pmxcfs[1236]: [dcdb] notice: received all states


I'm a bit blocked and running out of ideas...
 
I fired up all my nodes (17), and only 6 are synced.. Some of my nodes are up for over 2 hours, all containers are up, but nodes are still not synced...

Any way to force a sync ? I'm kind of worried anything happens if I let it this way..
 
Posting how I solved it:
Manually restarted all nodes that were not in sync and made sure the sqlite file had nos corruption:

service corosync stop
rm /var/lib/pve-cluster/.pmxcfs.lockfile
rm backup.db
sqlite3 /var/lib/pve-cluster/config.db
.output backup.db
.dump
.quit
sqlite3 database_fixed.db
.read backup.db
.quit
mv /var/lib/pve-cluster/config.db /var/lib/pve-cluster/config.db_hs_auto
mv database_fixed.db /var/lib/pve-cluster/config.db
pmxcfs -l
cp /etc/corosync/corosync.conf /etc/pve/corosync.conf
service corosync start
rm /var/lib/pve-cluster/.pmxcfs.lockfile
service pve-cluster start
service pvedaemon restart
service pveproxy restart

On one node I had duplicate entries that I had to manually delete from backup.db before restoring the backup into database_fixed.db

Got all my nodes running smoothly again.