[SOLVED] Nodes stay Syncing after update

Soporte Servi · Jul 1, 2019

Today we have updated a cluster of three nodes. After the update, the master and one of the two nodes have started to attack the DB of the third node as if it were the master, while in the cluster state they were in 'syncing' mode instead of 'active' mode that only the last node in that state has been left.

In the logs of the first two servers we have obtained the following:

Jul 1 16:51:46 pmgmail1 pmgmirror [32177]: database sync 'pmgmail2' failed - command 'rsync' --rsh = ssh -l root -o BatchMode = yes -o HostKeyAlias = pmgmail2 '-q --timeout 10' [xxxx]: / var / spool / pmg '/ var / spool / pmg --files-from /tmp/quarantinefilelist.32177' failed: exit code 23

The file /tmp/quarantinefilelist.32177 does not exist on any of the three servers.

pmgversion:

proxmox-mailgateway: 5.2-1 (API: 5.2-3 / 26df5d99, running kernel: 4.15.18-16-pve)
pmg-api: 5.2-3
pmg-gui: 1.0-45
pve-kernel-4.15: 5.4-4
pve-kernel-4.13: 5.1-45
pve-kernel-4.15.18-16-pve: 4.15.18-41
pve-kernel-4.15.18-12-pve: 4.15.18-36
pve-kernel-4.15.18-10-pve: 4.15.18-31
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.18-4-pve: 4.15.18-23
pve-kernel-4.13.16-3-pve: 4.13.16-49
pve-kernel-4.13.16-1-pve: 4.13.16-46
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.13-5-pve: 4.13.13-38
libarchive-perl: 3.2.1-1
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-52
libpve-http-server-perl: 2.0-12
libxdgmime-perl: 0.01-3
lvm2: 2.02.168-2
pmg-docs: 5.2-3
proxmox-spamassassin: 3.4.2-2
proxmox-widget-toolkit: 1.0-28
pve-firmware: 2.0-5
pve-xtermjs: 3.10.1-2
zfsutils-linux: 0.7.13-pve1 ~ bpo1

KOSTAS TSIVERIOTIS · Aug 7, 2019

Did you have any workaround for this?

Soporte Servi · Aug 27, 2019

Not yet.

Stoiko Ivanov · Aug 27, 2019

* Please try to restart the `pmgmirror` and `pmgtunnel` services.
* what's the (redacted!) output of `pmgcm status`

Thanks!

Soporte Servi · Aug 28, 2019

Here's the output of the 'pmgcm status':

root@host1:/etc/pmg# pmgcm status
NAME(CID)--------------IPADDRESS----ROLE-STATE---------UPTIME---LOAD----MEM---DISK
host1(1) X.X.X.1 master S 57 days 20:33 0.23 55% 20%
host3(3) X.X.X.3 node S 57 days 20:09 0.12 43% 39%
host2(2) X.X.X.2 node A 57 days 16:25 0.06 56% 36%

Stoiko Ivanov · Aug 28, 2019

how many files (and how large are they) are in /var/spool/pmg on host2 ?

Soporte Servi · Aug 28, 2019

Here's the content of /var/spool/pmg on host2:

root@host2:/var/spool/pmg# du -sh *
32K active
7.2G cluster
4.0K spam
4.0K virus

Stoiko Ivanov · Aug 28, 2019

Soporte Servi said:
7.2G cluster

* That might explain it - 7.2G quarantined mail?!
* the timeout (10) is currently hardcoded in the sourcecode (/usr/share/perl5/PMG/Cluster.pm)

However 7.2G of quarantined mail is really odd - please check how that came to happen...

Soporte Servi · Aug 28, 2019

It is possible to change the timeout?

Stoiko Ivanov · Aug 28, 2019

Hm - you can set the timeout in the source-code : /usr/share/perl5/PMG/Cluster.pm line 303
and restart the cluster-services (`pmgmirror`)
(for the amount of data set it to 120, if you have gigabit between the nodes)
If this resolves the issue we can consider making it configurable.

However I'm still curious what files take up so much space - could you please check which files are the largest, and where they are?

Thanks!

Soporte Servi · Aug 28, 2019

Still getting the same error:

Aug 28 10:45:17 host1 pmgmirror[26930]: database sync 'host2' failed - command 'rsync '--rsh=ssh -l root -o BatchMode=yes -o HostKeyAlias=host2' -q --timeout 120 '[X.X.X.2]:/var/spool/pmg' /var/spool/pmg --files-from /tmp/quarantinefilelist.26930' failed: exit code 23

About your question, here's what we have:

root@host2:/var/spool/pmg/cluster# du -sh *
1.5G 1
2.5G 2
3.3G 3

root@host2:/var/spool/pmg/cluster/3# du -sh *
2.4G spam
903M virus

root@host2:/var/spool/pmg/cluster/3/spam# ls
00 05 0A 0F 14 19 1E 23 28 2D 32 37 3C 41 46 4B 50 55 5A 5F 64 69 6E 73 78 7D 82 87 8C 91 96 9B A0 A5 AA AF B4 B9 BE C3 C8 CD D2 D7 DC E1 E6 EB F0 F5 FA FF
01 06 0B 10 15 1A 1F 24 29 2E 33 38 3D 42 47 4C 51 56 5B 60 65 6A 6F 74 79 7E 83 88 8D 92 97 9C A1 A6 AB B0 B5 BA BF C4 C9 CE D3 D8 DD E2 E7 EC F1 F6 FB
02 07 0C 11 16 1B 20 25 2A 2F 34 39 3E 43 48 4D 52 57 5C 61 66 6B 70 75 7A 7F 84 89 8E 93 98 9D A2 A7 AC B1 B6 BB C0 C5 CA CF D4 D9 DE E3 E8 ED F2 F7 FC
03 08 0D 12 17 1C 21 26 2B 30 35 3A 3F 44 49 4E 53 58 5D 62 67 6C 71 76 7B 80 85 8A 8F 94 99 9E A3 A8 AD B2 B7 BC C1 C6 CB D0 D5 DA DF E4 E9 EE F3 F8 FD
04 09 0E 13 18 1D 22 27 2C 31 36 3B 40 45 4A 4F 54 59 5E 63 68 6D 72 77 7C 81 86 8B 90 95 9A 9F A4 A9 AE B3 B8 BD C2 C7 CC D1 D6 DB E0 E5 EA EF F4 F9 FE

Stoiko Ivanov · Aug 28, 2019

* What bandwidth do you have between the nodes?
* Please increase it quite some more - the error 23 is due to the partial transmit (i expect because the timeout runs out before)

Soporte Servi · Aug 28, 2019

The bandwith between the nodes is more than a GB.

Changed the timeout to 300 and still the same error.

Stoiko Ivanov · Aug 28, 2019

hmm - you could test how long a plain rsync would take:

Code:

mkdir -p /tmp/pmgsynctest                                                       
time rsync '--rsh=ssh -l root -o BatchMode=yes -o HostKeyAlias=host2' -q --timeout 120 '[X.X.X.2]:/var/spool/pmg' /tmp/pmgsynctest

Thanks!

Soporte Servi · Aug 28, 2019

Output:

root@host1:/var/spool/pmg# time rsync '--rsh=ssh -l root -o BatchMode=yes -o HostKeyAlias=host2' -q --timeout 300 '[X.X.X.2]:/var/spool/pmg' /tmp/pmgsynctest
real 0m0.159s
user 0m0.013s
sys 0m0.002s

Stoiko Ivanov · Aug 28, 2019

sorry - I mistakenly pasted the '-q' in the command and forgot that we don't have an explicit file-list - can you try with '-av' instead (adds recursive syncing and verbose output)

Soporte Servi · Aug 28, 2019

sent 1,462,908 bytes received 7,522,633,219 bytes 15,184,855.96 bytes/sec
total size is 7,515,180,685 speedup is 1.00

real 8m15.260s
user 1m2.336s
sys 0m36.804s

Stoiko Ivanov · Aug 28, 2019

Soporte Servi said:
8m15.260s

Thats more than the 300 you set (although rsync should only transfer the delta) - maybe try with a timeout of 1000
(the next run would need to be faster...)
also - once the mirror start you should be able to see a file in the master's 'tmp/' starting with quarantinefilelist. - if possible please copy it and run `wc -l` on it

Thanks!

Soporte Servi · Aug 28, 2019

Even with 2000s timeout I still get the same error code.

About the quarantinefile:

root@host2:/tmp# cat quarantinefilelist.20882 | wc -l
1000

Stoiko Ivanov · Aug 28, 2019

ok - did you get any error while you ran the rsync command manually?

[SOLVED] Nodes stay Syncing after update

New Member

New Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member