Master stuck in 'syncing' state

William Edwards · Jan 15, 2020

I just migrated a cluster to another vLAN by swapping NICs and changing cluster IPs in `/etc/hosts` and `/etc/pmg/cluster.conf`. Everything seems to be ok, except the master is stuck in 'syncing' state. This seems to be due to 'pmgmirror' database sync failing, but I don't understand why.

Sync is ok on the slave:

Code:

Jan 15 15:24:29 mxfilter0-1.prorelay.nl pmgmirror[1097]: finished rule database sync from host '185.233.175.215'
Jan 15 15:24:29 mxfilter0-1.prorelay.nl pmgmirror[1097]: cluster syncronization finished  (0 errors, 0.57 seconds (files 0.15, database 0.25, config 0.17))
Jan 15 15:26:28 mxfilter0-1.prorelay.nl pmgmirror[1097]: starting cluster syncronization
Jan 15 15:26:29 mxfilter0-1.prorelay.nl pmgmirror[1097]: cluster syncronization finished  (0 errors, 0.66 seconds (files 0.14, database 0.26, config 0.26))
Jan 15 15:28:28 mxfilter0-1.prorelay.nl pmgmirror[1097]: starting cluster syncronization
Jan 15 15:28:28 mxfilter0-1.prorelay.nl pmgmirror[1097]: detected rule database changes - starting sync from '185.233.175.215'
Jan 15 15:28:28 mxfilter0-1.prorelay.nl pmgmirror[1097]: finished rule database sync from host '185.233.175.215'
Jan 15 15:28:29 mxfilter0-1.prorelay.nl pmgmirror[1097]: cluster syncronization finished  (0 errors, 0.87 seconds (files 0.19, database 0.52, config 0.17))
Jan 15 15:30:28 mxfilter0-1.prorelay.nl pmgmirror[1097]: starting cluster syncronization
Jan 15 15:30:28 mxfilter0-1.prorelay.nl pmgmirror[1097]: cluster syncronization finished  (0 errors, 0.59 seconds (files 0.16, database 0.26, config 0.17))

Sync isn't ok on the master:

Code:

Jan 15 15:20:45 mxfilter0-0 pmgmirror[5365]: starting cluster syncronization
Jan 15 15:20:45 mxfilter0-0 pmgmirror[5365]: database sync 'mxfilter0-1' failed - command 'rsync '--rsh=ssh -l root -o BatchMode=yes -o HostKeyAlias=mxfilter0-1' -q --timeout 10 '[185.233.175.216]:/var/spool/pmg' /var/spool/pmg --files-from /tmp/quarantinefilelist.5365' failed: exit code 23
Jan 15 15:20:45 mxfilter0-0 pmgmirror[5365]: cluster syncronization finished  (1 errors, 0.39 seconds (files 0.00, database 0.39, config 0.00))
Jan 15 15:22:45 mxfilter0-0 pmgmirror[5365]: starting cluster syncronization
Jan 15 15:22:46 mxfilter0-0 pmgmirror[5365]: database sync 'mxfilter0-1' failed - command 'rsync '--rsh=ssh -l root -o BatchMode=yes -o HostKeyAlias=mxfilter0-1' -q --timeout 10 '[185.233.175.216]:/var/spool/pmg' /var/spool/pmg --files-from /tmp/quarantinefilelist.5365' failed: exit code 23
Jan 15 15:22:46 mxfilter0-0 pmgmirror[5365]: cluster syncronization finished  (1 errors, 0.34 seconds (files 0.00, database 0.34, config 0.00))
Jan 15 15:24:45 mxfilter0-0 pmgmirror[5365]: starting cluster syncronization
Jan 15 15:24:45 mxfilter0-0 pmgmirror[5365]: database sync 'mxfilter0-1' failed - command 'rsync '--rsh=ssh -l root -o BatchMode=yes -o HostKeyAlias=mxfilter0-1' -q --timeout 10 '[185.233.175.216]:/var/spool/pmg' /var/spool/pmg --files-from /tmp/quarantinefilelist.5365' failed: exit code 23
Jan 15 15:24:45 mxfilter0-0 pmgmirror[5365]: cluster syncronization finished  (1 errors, 0.34 seconds (files 0.00, database 0.34, config 0.00))
Jan 15 15:26:45 mxfilter0-0 pmgmirror[5365]: starting cluster syncronization
Jan 15 15:26:45 mxfilter0-0 pmgmirror[5365]: database sync 'mxfilter0-1' failed - command 'rsync '--rsh=ssh -l root -o BatchMode=yes -o HostKeyAlias=mxfilter0-1' -q --timeout 10 '[185.233.175.216]:/var/spool/pmg' /var/spool/pmg --files-from /tmp/quarantinefilelist.5365' failed: exit code 23
Jan 15 15:26:45 mxfilter0-0 pmgmirror[5365]: cluster syncronization finished  (1 errors, 0.42 seconds (files 0.00, database 0.42, config 0.00))
Jan 15 15:28:45 mxfilter0-0 pmgmirror[5365]: starting cluster syncronization
Jan 15 15:28:46 mxfilter0-0 pmgmirror[5365]: database sync 'mxfilter0-1' failed - command 'rsync '--rsh=ssh -l root -o BatchMode=yes -o HostKeyAlias=mxfilter0-1' -q --timeout 10 '[185.233.175.216]:/var/spool/pmg' /var/spool/pmg --files-from /tmp/quarantinefilelist.5365' failed: exit code 23
Jan 15 15:28:46 mxfilter0-0 pmgmirror[5365]: cluster syncronization finished  (1 errors, 0.35 seconds (files 0.00, database 0.35, config 0.00))

I do not understand why the rsync is failing. The nodes can SSH into each other, and I cannot manually replicate the rsync as there are no `/tmp/quarantinefilelist.*` files.

`pmgmirror` on the master seems to run for a long time and timeout after waiting for something:

Code:

root@mxfilter0-0:/var/log# strace -p 5365
strace: Process 5365 attached
restart_syscall(<... resuming interrupted nanosleep ...>) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, ^Cstrace: Process 5365 detached
<detached ...>

There is no trace of the master attempting to connect to the slave on the slave.

Any idea?

Stoiko Ivanov · Jan 15, 2020

does:

Code:

ssh -l root -o BatchMode=yes -o HostKeyAlias=mxfilter0-1 185.233.175.216

work?
else try rsyncing to some other directory:

Code:

rsync '--rsh=ssh -l root -o BatchMode=yes -o HostKeyAlias=mxfilter0-1' -av --timeout 10 '[185.233.175.216]:/var/spool/pmg'  /tmp/test/

I hope this helps!

William Edwards · Jan 15, 2020

Stoiko Ivanov said:
does:

Code:

ssh -l root -o BatchMode=yes -o HostKeyAlias=mxfilter0-1 185.233.175.216

work?
else try rsyncing to some other directory:

Code:

rsync '--rsh=ssh -l root -o BatchMode=yes -o HostKeyAlias=mxfilter0-1' -av --timeout 10 '[185.233.175.216]:/var/spool/pmg' /tmp/test/

I hope this helps!

Both work..

Stoiko Ivanov · Jan 15, 2020

How large are the directories under '/var/spool/pmg' on both nodes?

William Edwards · Jan 15, 2020

Stoiko Ivanov said:
How large are the directories under '/var/spool/pmg' on both nodes?

70M on master, 480M on slave.

Stoiko Ivanov · Jan 15, 2020

Hmm - we had a similar case in the forum some time ago.
In that case there was a mismatch between the database and what was available on disk.

Check it out and try the proposed steps: https://forum.proxmox.com/threads/nodes-stay-syncing-after-update.55577/

I hope this helps!

William Edwards · Jan 15, 2020

Stoiko Ivanov said:
Hmm - we had a similar case in the forum some time ago.
In that case there was a mismatch between the database and what was available on disk.

Check it out and try the proposed steps: https://forum.proxmox.com/threads/nodes-stay-syncing-after-update.55577/

I hope this helps!

This seems quite likely, as both nodes were only able to reach each other on-and-off during the IP swap. I executed `pmgmirror stop` and `pmgmirror start` after patching `Cluster.pm`:

Code:

root@mxfilter0-0:~# diff Cluster.pm /usr/share/perl5/PMG/Cluster.pm
303c303
<     $host_name, '--timeout', '10', "[${host_ip}]:$spooldir", $spooldir,
---
>     $host_name, '--timeout', '9999999', "[${host_ip}]:$spooldir", $spooldir,

I will post the result today or tomorrow. If this doesn't help, I will try the other possible solution mentioned in the thread you linked to.

William Edwards · Jan 16, 2020

Increasing the timeout did not help. Neither did the solution in the other thread, or so it seems.

I rsync'ed `/var/spool/pmg` so the situation is the same on both nodes.

Code:

root@mxfilter0-0:/var/spool/pmg# find | wc -l
622
root@mxfilter0-1:/var/spool/pmg# find | wc -l
622

I can imagine I now need to update/re-index the database somewhere. Could you point me in the right direction?

Deleting `/var/spool/pmg` would not be an issue either, by the way. I do not have to retain quarantined mail.

Stoiko Ivanov · Jan 16, 2020

did you fetch the /tmp/quarantinefilelist.xxxx file while a sync was ongoing - and did you try to run the rsync with --files-from ?
(could you post the output?)

William Edwards · Jan 16, 2020

Stoiko Ivanov said:
did you fetch the /tmp/quarantinefilelist.xxxx file while a sync was ongoing - and did you try to run the rsync with --files-from ?
(could you post the output?)

I am currently running this script in a screen with `watch -n 1`:

Code:

root@mxfilter0-0:~# cat fetch.sh
cat /tmp/quarantinefilelist.1990 >> tmp.txt

I will post the results in a bit.

regis.mailleux · Jan 17, 2020

Dear, I follow your post because I have exactely le same problem.

William Edwards · Jan 18, 2020

Well, seems like both my master and slave are stuck in 'syncing' now:

Code:

root@mxfilter0-0:~# pmgcm status
NAME(CID)--------------IPADDRESS----ROLE-STATE---------UPTIME---LOAD----MEM---DISK
mxfilter0-0(1)       185.233.175.215 master S    2 days 13:53   0.28    57%    75%
mxfilter0-1(2)       185.233.175.216 node   S    2 days 19:35   0.12    61%    76%

How can I tell PMG to 'forget' previous entries in `/var/spool/pmg` to resolve this situation? I am fine with removing old quarantined mail.

Stoiko Ivanov · Jan 21, 2020

Did you manage to get the output of rsync with the quarantinefilelist file?

William Edwards said:
Well, seems like both my master and slave are stuck in 'syncing' now:

anything relevant in the log?

William Edwards said:
How can I tell PMG to 'forget' previous entries in `/var/spool/pmg` to resolve this situation?

usually setting the quarantine lifetime to a lower time and running the pmg-daily task should work AFAIR

William Edwards · Jan 24, 2020

Stoiko Ivanov said:
Did you manage to get the output of rsync with the quarantinefilelist file?

anything relevant in the log?

usually setting the quarantine lifetime to a lower time and running the pmg-daily task should work AFAIR

I solved this by promoting previous slave to master, deleting new slave from cluster, then re-adding old master to cluster as slave (had to remove `/etc/pmg/cluster.conf`). I then promoted the old slave again.

Code:

root@mxfilter0-0:~# pmgcm status
NAME(CID)--------------IPADDRESS----ROLE-STATE---------UPTIME---LOAD----MEM---DISK
mxfilter0-0(3)       185.233.175.215 master A    8 days 14:40   0.12    57%    79%
mxfilter0-1(2)       185.233.175.216 node   A    8 days 20:22   0.36    60%    78%

Search

Search

Master stuck in 'syncing' state

William Edwards

Renowned Member

Stoiko Ivanov

Proxmox Staff Member

William Edwards

Renowned Member

Stoiko Ivanov

Proxmox Staff Member

William Edwards

Renowned Member

Stoiko Ivanov

Proxmox Staff Member

William Edwards

Renowned Member

William Edwards

Renowned Member

Stoiko Ivanov

Proxmox Staff Member

William Edwards

Renowned Member

regis.mailleux

Active Member

William Edwards

Renowned Member

Stoiko Ivanov

Proxmox Staff Member

William Edwards

Renowned Member

We value your privacy