Master stuck in 'syncing' state

May 20, 2017
157
15
38
Netherlands
cyberfusion.nl
I just migrated a cluster to another vLAN by swapping NICs and changing cluster IPs in `/etc/hosts` and `/etc/pmg/cluster.conf`. Everything seems to be ok, except the master is stuck in 'syncing' state. This seems to be due to 'pmgmirror' database sync failing, but I don't understand why.

Sync is ok on the slave:

Code:
Jan 15 15:24:29 mxfilter0-1.prorelay.nl pmgmirror[1097]: finished rule database sync from host '185.233.175.215'
Jan 15 15:24:29 mxfilter0-1.prorelay.nl pmgmirror[1097]: cluster syncronization finished  (0 errors, 0.57 seconds (files 0.15, database 0.25, config 0.17))
Jan 15 15:26:28 mxfilter0-1.prorelay.nl pmgmirror[1097]: starting cluster syncronization
Jan 15 15:26:29 mxfilter0-1.prorelay.nl pmgmirror[1097]: cluster syncronization finished  (0 errors, 0.66 seconds (files 0.14, database 0.26, config 0.26))
Jan 15 15:28:28 mxfilter0-1.prorelay.nl pmgmirror[1097]: starting cluster syncronization
Jan 15 15:28:28 mxfilter0-1.prorelay.nl pmgmirror[1097]: detected rule database changes - starting sync from '185.233.175.215'
Jan 15 15:28:28 mxfilter0-1.prorelay.nl pmgmirror[1097]: finished rule database sync from host '185.233.175.215'
Jan 15 15:28:29 mxfilter0-1.prorelay.nl pmgmirror[1097]: cluster syncronization finished  (0 errors, 0.87 seconds (files 0.19, database 0.52, config 0.17))
Jan 15 15:30:28 mxfilter0-1.prorelay.nl pmgmirror[1097]: starting cluster syncronization
Jan 15 15:30:28 mxfilter0-1.prorelay.nl pmgmirror[1097]: cluster syncronization finished  (0 errors, 0.59 seconds (files 0.16, database 0.26, config 0.17))


Sync isn't ok on the master:

Code:
Jan 15 15:20:45 mxfilter0-0 pmgmirror[5365]: starting cluster syncronization
Jan 15 15:20:45 mxfilter0-0 pmgmirror[5365]: database sync 'mxfilter0-1' failed - command 'rsync '--rsh=ssh -l root -o BatchMode=yes -o HostKeyAlias=mxfilter0-1' -q --timeout 10 '[185.233.175.216]:/var/spool/pmg' /var/spool/pmg --files-from /tmp/quarantinefilelist.5365' failed: exit code 23
Jan 15 15:20:45 mxfilter0-0 pmgmirror[5365]: cluster syncronization finished  (1 errors, 0.39 seconds (files 0.00, database 0.39, config 0.00))
Jan 15 15:22:45 mxfilter0-0 pmgmirror[5365]: starting cluster syncronization
Jan 15 15:22:46 mxfilter0-0 pmgmirror[5365]: database sync 'mxfilter0-1' failed - command 'rsync '--rsh=ssh -l root -o BatchMode=yes -o HostKeyAlias=mxfilter0-1' -q --timeout 10 '[185.233.175.216]:/var/spool/pmg' /var/spool/pmg --files-from /tmp/quarantinefilelist.5365' failed: exit code 23
Jan 15 15:22:46 mxfilter0-0 pmgmirror[5365]: cluster syncronization finished  (1 errors, 0.34 seconds (files 0.00, database 0.34, config 0.00))
Jan 15 15:24:45 mxfilter0-0 pmgmirror[5365]: starting cluster syncronization
Jan 15 15:24:45 mxfilter0-0 pmgmirror[5365]: database sync 'mxfilter0-1' failed - command 'rsync '--rsh=ssh -l root -o BatchMode=yes -o HostKeyAlias=mxfilter0-1' -q --timeout 10 '[185.233.175.216]:/var/spool/pmg' /var/spool/pmg --files-from /tmp/quarantinefilelist.5365' failed: exit code 23
Jan 15 15:24:45 mxfilter0-0 pmgmirror[5365]: cluster syncronization finished  (1 errors, 0.34 seconds (files 0.00, database 0.34, config 0.00))
Jan 15 15:26:45 mxfilter0-0 pmgmirror[5365]: starting cluster syncronization
Jan 15 15:26:45 mxfilter0-0 pmgmirror[5365]: database sync 'mxfilter0-1' failed - command 'rsync '--rsh=ssh -l root -o BatchMode=yes -o HostKeyAlias=mxfilter0-1' -q --timeout 10 '[185.233.175.216]:/var/spool/pmg' /var/spool/pmg --files-from /tmp/quarantinefilelist.5365' failed: exit code 23
Jan 15 15:26:45 mxfilter0-0 pmgmirror[5365]: cluster syncronization finished  (1 errors, 0.42 seconds (files 0.00, database 0.42, config 0.00))
Jan 15 15:28:45 mxfilter0-0 pmgmirror[5365]: starting cluster syncronization
Jan 15 15:28:46 mxfilter0-0 pmgmirror[5365]: database sync 'mxfilter0-1' failed - command 'rsync '--rsh=ssh -l root -o BatchMode=yes -o HostKeyAlias=mxfilter0-1' -q --timeout 10 '[185.233.175.216]:/var/spool/pmg' /var/spool/pmg --files-from /tmp/quarantinefilelist.5365' failed: exit code 23
Jan 15 15:28:46 mxfilter0-0 pmgmirror[5365]: cluster syncronization finished  (1 errors, 0.35 seconds (files 0.00, database 0.35, config 0.00))


I do not understand why the rsync is failing. The nodes can SSH into each other, and I cannot manually replicate the rsync as there are no `/tmp/quarantinefilelist.*` files.

`pmgmirror` on the master seems to run for a long time and timeout after waiting for something:

Code:
root@mxfilter0-0:/var/log# strace -p 5365
strace: Process 5365 attached
restart_syscall(<... resuming interrupted nanosleep ...>) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, ^Cstrace: Process 5365 detached
<detached ...>


There is no trace of the master attempting to connect to the slave on the slave.

Any idea?
 
Last edited:

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
5,997
833
148
does:
Code:
ssh -l root -o BatchMode=yes -o HostKeyAlias=mxfilter0-1 185.233.175.216
work?
else try rsyncing to some other directory:
Code:
rsync '--rsh=ssh -l root -o BatchMode=yes -o HostKeyAlias=mxfilter0-1' -av --timeout 10 '[185.233.175.216]:/var/spool/pmg'  /tmp/test/

I hope this helps!
 

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
5,997
833
148
How large are the directories under '/var/spool/pmg' on both nodes?
 

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
5,997
833
148
May 20, 2017
157
15
38
Netherlands
cyberfusion.nl
Hmm - we had a similar case in the forum some time ago.
In that case there was a mismatch between the database and what was available on disk.

Check it out and try the proposed steps: https://forum.proxmox.com/threads/nodes-stay-syncing-after-update.55577/

I hope this helps!
This seems quite likely, as both nodes were only able to reach each other on-and-off during the IP swap. I executed `pmgmirror stop` and `pmgmirror start` after patching `Cluster.pm`:

Code:
root@mxfilter0-0:~# diff Cluster.pm /usr/share/perl5/PMG/Cluster.pm
303c303
<     $host_name, '--timeout', '10', "[${host_ip}]:$spooldir", $spooldir,
---
>     $host_name, '--timeout', '9999999', "[${host_ip}]:$spooldir", $spooldir,


I will post the result today or tomorrow. If this doesn't help, I will try the other possible solution mentioned in the thread you linked to.
 
  • Like
Reactions: Stoiko Ivanov
May 20, 2017
157
15
38
Netherlands
cyberfusion.nl
Increasing the timeout did not help. Neither did the solution in the other thread, or so it seems.

I rsync'ed `/var/spool/pmg` so the situation is the same on both nodes.

Code:
root@mxfilter0-0:/var/spool/pmg# find | wc -l
622
root@mxfilter0-1:/var/spool/pmg# find | wc -l
622


I can imagine I now need to update/re-index the database somewhere. Could you point me in the right direction?

Deleting `/var/spool/pmg` would not be an issue either, by the way. I do not have to retain quarantined mail.
 
Last edited:

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
5,997
833
148
did you fetch the /tmp/quarantinefilelist.xxxx file while a sync was ongoing - and did you try to run the rsync with --files-from ?
(could you post the output?)
 
May 20, 2017
157
15
38
Netherlands
cyberfusion.nl
did you fetch the /tmp/quarantinefilelist.xxxx file while a sync was ongoing - and did you try to run the rsync with --files-from ?
(could you post the output?)

I am currently running this script in a screen with `watch -n 1`:

Code:
root@mxfilter0-0:~# cat fetch.sh
cat /tmp/quarantinefilelist.1990 >> tmp.txt


I will post the results in a bit.
 
May 20, 2017
157
15
38
Netherlands
cyberfusion.nl
Well, seems like both my master and slave are stuck in 'syncing' now:

Code:
root@mxfilter0-0:~# pmgcm status
NAME(CID)--------------IPADDRESS----ROLE-STATE---------UPTIME---LOAD----MEM---DISK
mxfilter0-0(1)       185.233.175.215 master S    2 days 13:53   0.28    57%    75%
mxfilter0-1(2)       185.233.175.216 node   S    2 days 19:35   0.12    61%    76%


How can I tell PMG to 'forget' previous entries in `/var/spool/pmg` to resolve this situation? I am fine with removing old quarantined mail.
 

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
5,997
833
148
Did you manage to get the output of rsync with the quarantinefilelist file?

Well, seems like both my master and slave are stuck in 'syncing' now:
anything relevant in the log?

How can I tell PMG to 'forget' previous entries in `/var/spool/pmg` to resolve this situation?
usually setting the quarantine lifetime to a lower time and running the pmg-daily task should work AFAIR
 
May 20, 2017
157
15
38
Netherlands
cyberfusion.nl
Did you manage to get the output of rsync with the quarantinefilelist file?


anything relevant in the log?


usually setting the quarantine lifetime to a lower time and running the pmg-daily task should work AFAIR

I solved this by promoting previous slave to master, deleting new slave from cluster, then re-adding old master to cluster as slave (had to remove `/etc/pmg/cluster.conf`). I then promoted the old slave again.

Code:
root@mxfilter0-0:~# pmgcm status
NAME(CID)--------------IPADDRESS----ROLE-STATE---------UPTIME---LOAD----MEM---DISK
mxfilter0-0(3)       185.233.175.215 master A    8 days 14:40   0.12    57%    79%
mxfilter0-1(2)       185.233.175.216 node   A    8 days 20:22   0.36    60%    78%
 
  • Like
Reactions: Stoiko Ivanov

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!