I just migrated a cluster to another vLAN by swapping NICs and changing cluster IPs in `/etc/hosts` and `/etc/pmg/cluster.conf`. Everything seems to be ok, except the master is stuck in 'syncing' state. This seems to be due to 'pmgmirror' database sync failing, but I don't understand why.
Sync is ok on the slave:
Sync isn't ok on the master:
I do not understand why the rsync is failing. The nodes can SSH into each other, and I cannot manually replicate the rsync as there are no `/tmp/quarantinefilelist.*` files.
`pmgmirror` on the master seems to run for a long time and timeout after waiting for something:
There is no trace of the master attempting to connect to the slave on the slave.
Any idea?
Sync is ok on the slave:
Code:
Jan 15 15:24:29 mxfilter0-1.prorelay.nl pmgmirror[1097]: finished rule database sync from host '185.233.175.215'
Jan 15 15:24:29 mxfilter0-1.prorelay.nl pmgmirror[1097]: cluster syncronization finished (0 errors, 0.57 seconds (files 0.15, database 0.25, config 0.17))
Jan 15 15:26:28 mxfilter0-1.prorelay.nl pmgmirror[1097]: starting cluster syncronization
Jan 15 15:26:29 mxfilter0-1.prorelay.nl pmgmirror[1097]: cluster syncronization finished (0 errors, 0.66 seconds (files 0.14, database 0.26, config 0.26))
Jan 15 15:28:28 mxfilter0-1.prorelay.nl pmgmirror[1097]: starting cluster syncronization
Jan 15 15:28:28 mxfilter0-1.prorelay.nl pmgmirror[1097]: detected rule database changes - starting sync from '185.233.175.215'
Jan 15 15:28:28 mxfilter0-1.prorelay.nl pmgmirror[1097]: finished rule database sync from host '185.233.175.215'
Jan 15 15:28:29 mxfilter0-1.prorelay.nl pmgmirror[1097]: cluster syncronization finished (0 errors, 0.87 seconds (files 0.19, database 0.52, config 0.17))
Jan 15 15:30:28 mxfilter0-1.prorelay.nl pmgmirror[1097]: starting cluster syncronization
Jan 15 15:30:28 mxfilter0-1.prorelay.nl pmgmirror[1097]: cluster syncronization finished (0 errors, 0.59 seconds (files 0.16, database 0.26, config 0.17))
Sync isn't ok on the master:
Code:
Jan 15 15:20:45 mxfilter0-0 pmgmirror[5365]: starting cluster syncronization
Jan 15 15:20:45 mxfilter0-0 pmgmirror[5365]: database sync 'mxfilter0-1' failed - command 'rsync '--rsh=ssh -l root -o BatchMode=yes -o HostKeyAlias=mxfilter0-1' -q --timeout 10 '[185.233.175.216]:/var/spool/pmg' /var/spool/pmg --files-from /tmp/quarantinefilelist.5365' failed: exit code 23
Jan 15 15:20:45 mxfilter0-0 pmgmirror[5365]: cluster syncronization finished (1 errors, 0.39 seconds (files 0.00, database 0.39, config 0.00))
Jan 15 15:22:45 mxfilter0-0 pmgmirror[5365]: starting cluster syncronization
Jan 15 15:22:46 mxfilter0-0 pmgmirror[5365]: database sync 'mxfilter0-1' failed - command 'rsync '--rsh=ssh -l root -o BatchMode=yes -o HostKeyAlias=mxfilter0-1' -q --timeout 10 '[185.233.175.216]:/var/spool/pmg' /var/spool/pmg --files-from /tmp/quarantinefilelist.5365' failed: exit code 23
Jan 15 15:22:46 mxfilter0-0 pmgmirror[5365]: cluster syncronization finished (1 errors, 0.34 seconds (files 0.00, database 0.34, config 0.00))
Jan 15 15:24:45 mxfilter0-0 pmgmirror[5365]: starting cluster syncronization
Jan 15 15:24:45 mxfilter0-0 pmgmirror[5365]: database sync 'mxfilter0-1' failed - command 'rsync '--rsh=ssh -l root -o BatchMode=yes -o HostKeyAlias=mxfilter0-1' -q --timeout 10 '[185.233.175.216]:/var/spool/pmg' /var/spool/pmg --files-from /tmp/quarantinefilelist.5365' failed: exit code 23
Jan 15 15:24:45 mxfilter0-0 pmgmirror[5365]: cluster syncronization finished (1 errors, 0.34 seconds (files 0.00, database 0.34, config 0.00))
Jan 15 15:26:45 mxfilter0-0 pmgmirror[5365]: starting cluster syncronization
Jan 15 15:26:45 mxfilter0-0 pmgmirror[5365]: database sync 'mxfilter0-1' failed - command 'rsync '--rsh=ssh -l root -o BatchMode=yes -o HostKeyAlias=mxfilter0-1' -q --timeout 10 '[185.233.175.216]:/var/spool/pmg' /var/spool/pmg --files-from /tmp/quarantinefilelist.5365' failed: exit code 23
Jan 15 15:26:45 mxfilter0-0 pmgmirror[5365]: cluster syncronization finished (1 errors, 0.42 seconds (files 0.00, database 0.42, config 0.00))
Jan 15 15:28:45 mxfilter0-0 pmgmirror[5365]: starting cluster syncronization
Jan 15 15:28:46 mxfilter0-0 pmgmirror[5365]: database sync 'mxfilter0-1' failed - command 'rsync '--rsh=ssh -l root -o BatchMode=yes -o HostKeyAlias=mxfilter0-1' -q --timeout 10 '[185.233.175.216]:/var/spool/pmg' /var/spool/pmg --files-from /tmp/quarantinefilelist.5365' failed: exit code 23
Jan 15 15:28:46 mxfilter0-0 pmgmirror[5365]: cluster syncronization finished (1 errors, 0.35 seconds (files 0.00, database 0.35, config 0.00))
I do not understand why the rsync is failing. The nodes can SSH into each other, and I cannot manually replicate the rsync as there are no `/tmp/quarantinefilelist.*` files.
`pmgmirror` on the master seems to run for a long time and timeout after waiting for something:
Code:
root@mxfilter0-0:/var/log# strace -p 5365
strace: Process 5365 attached
restart_syscall(<... resuming interrupted nanosleep ...>) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcc4b90790) = 0
nanosleep({tv_sec=1, tv_nsec=0}, ^Cstrace: Process 5365 detached
<detached ...>
There is no trace of the master attempting to connect to the slave on the slave.
Any idea?
Last edited: