[SOLVED] Nodes stay Syncing after update

Soporte Servi

New Member
Jul 1, 2019
13
0
1
25
Today we have updated a cluster of three nodes. After the update, the master and one of the two nodes have started to attack the DB of the third node as if it were the master, while in the cluster state they were in 'syncing' mode instead of 'active' mode that only the last node in that state has been left.

In the logs of the first two servers we have obtained the following:

Jul 1 16:51:46 pmgmail1 pmgmirror [32177]: database sync 'pmgmail2' failed - command 'rsync' --rsh = ssh -l root -o BatchMode = yes -o HostKeyAlias = pmgmail2 '-q --timeout 10' [xxxx]: / var / spool / pmg '/ var / spool / pmg --files-from /tmp/quarantinefilelist.32177' failed: exit code 23

The file /tmp/quarantinefilelist.32177 does not exist on any of the three servers.


pmgversion:

proxmox-mailgateway: 5.2-1 (API: 5.2-3 / 26df5d99, running kernel: 4.15.18-16-pve)
pmg-api: 5.2-3
pmg-gui: 1.0-45
pve-kernel-4.15: 5.4-4
pve-kernel-4.13: 5.1-45
pve-kernel-4.15.18-16-pve: 4.15.18-41
pve-kernel-4.15.18-12-pve: 4.15.18-36
pve-kernel-4.15.18-10-pve: 4.15.18-31
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.18-4-pve: 4.15.18-23
pve-kernel-4.13.16-3-pve: 4.13.16-49
pve-kernel-4.13.16-1-pve: 4.13.16-46
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.13-5-pve: 4.13.13-38
libarchive-perl: 3.2.1-1
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-52
libpve-http-server-perl: 2.0-12
libxdgmime-perl: 0.01-3
lvm2: 2.02.168-2
pmg-docs: 5.2-3
proxmox-spamassassin: 3.4.2-2
proxmox-widget-toolkit: 1.0-28
pve-firmware: 2.0-5
pve-xtermjs: 3.10.1-2
zfsutils-linux: 0.7.13-pve1 ~ bpo1
 
Last edited:

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
1,862
179
63
* Please try to restart the `pmgmirror` and `pmgtunnel` services.
* what's the (redacted!) output of `pmgcm status`

Thanks!
 

Soporte Servi

New Member
Jul 1, 2019
13
0
1
25
Here's the output of the 'pmgcm status':

root@host1:/etc/pmg# pmgcm status
NAME(CID)--------------IPADDRESS----ROLE-STATE---------UPTIME---LOAD----MEM---DISK
host1(1) X.X.X.1 master S 57 days 20:33 0.23 55% 20%
host3(3) X.X.X.3 node S 57 days 20:09 0.12 43% 39%
host2(2) X.X.X.2 node A 57 days 16:25 0.06 56% 36%
 

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
1,862
179
63
how many files (and how large are they) are in /var/spool/pmg on host2 ?
 

Soporte Servi

New Member
Jul 1, 2019
13
0
1
25
Here's the content of /var/spool/pmg on host2:

root@host2:/var/spool/pmg# du -sh *
32K active
7.2G cluster
4.0K spam
4.0K virus
 

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
1,862
179
63
7.2G cluster
* That might explain it - 7.2G quarantined mail?!
* the timeout (10) is currently hardcoded in the sourcecode (/usr/share/perl5/PMG/Cluster.pm)

However 7.2G of quarantined mail is really odd - please check how that came to happen...
 

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
1,862
179
63
Hm - you can set the timeout in the source-code : /usr/share/perl5/PMG/Cluster.pm line 303
and restart the cluster-services (`pmgmirror`)
(for the amount of data set it to 120, if you have gigabit between the nodes)
If this resolves the issue we can consider making it configurable.

However I'm still curious what files take up so much space - could you please check which files are the largest, and where they are?

Thanks!
 

Soporte Servi

New Member
Jul 1, 2019
13
0
1
25
Still getting the same error:

Aug 28 10:45:17 host1 pmgmirror[26930]: database sync 'host2' failed - command 'rsync '--rsh=ssh -l root -o BatchMode=yes -o HostKeyAlias=host2' -q --timeout 120 '[X.X.X.2]:/var/spool/pmg' /var/spool/pmg --files-from /tmp/quarantinefilelist.26930' failed: exit code 23


About your question, here's what we have:

root@host2:/var/spool/pmg/cluster# du -sh *
1.5G 1
2.5G 2
3.3G 3

root@host2:/var/spool/pmg/cluster/3# du -sh *
2.4G spam
903M virus

root@host2:/var/spool/pmg/cluster/3/spam# ls
00 05 0A 0F 14 19 1E 23 28 2D 32 37 3C 41 46 4B 50 55 5A 5F 64 69 6E 73 78 7D 82 87 8C 91 96 9B A0 A5 AA AF B4 B9 BE C3 C8 CD D2 D7 DC E1 E6 EB F0 F5 FA FF
01 06 0B 10 15 1A 1F 24 29 2E 33 38 3D 42 47 4C 51 56 5B 60 65 6A 6F 74 79 7E 83 88 8D 92 97 9C A1 A6 AB B0 B5 BA BF C4 C9 CE D3 D8 DD E2 E7 EC F1 F6 FB
02 07 0C 11 16 1B 20 25 2A 2F 34 39 3E 43 48 4D 52 57 5C 61 66 6B 70 75 7A 7F 84 89 8E 93 98 9D A2 A7 AC B1 B6 BB C0 C5 CA CF D4 D9 DE E3 E8 ED F2 F7 FC
03 08 0D 12 17 1C 21 26 2B 30 35 3A 3F 44 49 4E 53 58 5D 62 67 6C 71 76 7B 80 85 8A 8F 94 99 9E A3 A8 AD B2 B7 BC C1 C6 CB D0 D5 DA DF E4 E9 EE F3 F8 FD
04 09 0E 13 18 1D 22 27 2C 31 36 3B 40 45 4A 4F 54 59 5E 63 68 6D 72 77 7C 81 86 8B 90 95 9A 9F A4 A9 AE B3 B8 BD C2 C7 CC D1 D6 DB E0 E5 EA EF F4 F9 FE
 

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
1,862
179
63
* What bandwidth do you have between the nodes?
* Please increase it quite some more - the error 23 is due to the partial transmit (i expect because the timeout runs out before)
 

Soporte Servi

New Member
Jul 1, 2019
13
0
1
25
The bandwith between the nodes is more than a GB.

Changed the timeout to 300 and still the same error.
 

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
1,862
179
63
hmm - you could test how long a plain rsync would take:
Code:
mkdir -p /tmp/pmgsynctest                                                       
time rsync '--rsh=ssh -l root -o BatchMode=yes -o HostKeyAlias=host2' -q --timeout 120 '[X.X.X.2]:/var/spool/pmg' /tmp/pmgsynctest
Thanks!
 

Soporte Servi

New Member
Jul 1, 2019
13
0
1
25
Output:

root@host1:/var/spool/pmg# time rsync '--rsh=ssh -l root -o BatchMode=yes -o HostKeyAlias=host2' -q --timeout 300 '[X.X.X.2]:/var/spool/pmg' /tmp/pmgsynctest
real 0m0.159s
user 0m0.013s
sys 0m0.002s
 

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
1,862
179
63
sorry - I mistakenly pasted the '-q' in the command and forgot that we don't have an explicit file-list - can you try with '-av' instead (adds recursive syncing and verbose output)
 

Soporte Servi

New Member
Jul 1, 2019
13
0
1
25
sent 1,462,908 bytes received 7,522,633,219 bytes 15,184,855.96 bytes/sec
total size is 7,515,180,685 speedup is 1.00

real 8m15.260s
user 1m2.336s
sys 0m36.804s
 

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
1,862
179
63
8m15.260s
Thats more than the 300 you set (although rsync should only transfer the delta) - maybe try with a timeout of 1000
(the next run would need to be faster...)
also - once the mirror start you should be able to see a file in the master's 'tmp/' starting with quarantinefilelist. - if possible please copy it and run `wc -l` on it

Thanks!
 

Soporte Servi

New Member
Jul 1, 2019
13
0
1
25
Even with 2000s timeout I still get the same error code.

About the quarantinefile:

root@host2:/tmp# cat quarantinefilelist.20882 | wc -l
1000
 

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
1,862
179
63
ok - did you get any error while you ran the rsync command manually?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!