[SOLVED] Cluster unable to sync both ways since 11nov

c0urier

Renowned Member
Aug 29, 2011
28
2
68
Denmark
I've seen some other threads and not sure if it's related - But I'm trying to figure out how to get my cluster to sync both ways again since it has not worked since 2022.11.11 at 07.28.
The error I see is;

Code:
Nov 11 07:28:00 mail-gw02 pmgmirror[1017497]: cluster synchronization finished  (0 errors, 3.99 seconds (files 0.39, database 2.86, config 0.74))
Nov 11 07:29:56 mail-gw02 pmgmirror[1017497]: starting cluster synchronization
Nov 11 07:29:57 mail-gw02 pmgmirror[1017497]: database sync 'mail-gw01' failed - Wide character in subroutine entry at /usr/share/perl5/PMG/DBTools.pm line 1093.
Nov 11 07:29:59 mail-gw02 pmgmirror[1017497]: cluster synchronization finished  (1 errors, 3.27 seconds (files 0.00, database 2.54, config 0.73))
Nov 11 07:31:56 mail-gw02 pmgmirror[1017497]: starting cluster synchronization
Nov 11 07:31:58 mail-gw02 pmgmirror[1017497]: detected rule database changes - starting sync from '10.2.0.22'
Nov 11 07:31:58 mail-gw02 pmgmirror[1017497]: finished rule database sync from host '10.2.0.22'
Nov 11 07:31:58 mail-gw02 pmgmirror[1017497]: database sync 'mail-gw01' failed - Wide character in subroutine entry at /usr/share/perl5/PMG/DBTools.pm line 1093.
Nov 11 07:32:00 mail-gw02 pmgmirror[1017497]: cluster synchronization finished  (1 errors, 3.48 seconds (files 0.00, database 2.73, config 0.75))
And that just continues until this day.

From mail-gw02 to mail-gw01 everything is fine.

Code:
root@mail-gw02:~> pmgcm status
NAME(CID)--------------IPADDRESS----ROLE-STATE---------UPTIME---LOAD----MEM---DISK
mail-gw01(2)         10.2.0.22       master A    9 days 21:25   0.18    38%    21%
mail-gw02(1)         10.4.0.2        node   S    9 days 21:25   0.16    38%    22%

Code:
root@mail-gw02:~> pmgversion  -v
proxmox-mailgateway: 7.1-2 (API: 7.1-9/e0c0be55, running kernel: 5.15.64-1-pve)
pmg-api: 7.1-9
pmg-gui: 3.1-6
pve-kernel-5.15: 7.2-14
pve-kernel-helper: 7.2-14
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
clamav-daemon: 0.103.7+dfsg-0+deb11u1
ifupdown: 0.8.36+pve2
libarchive-perl: 3.4.0-1
libjs-extjs: 7.0.0-1
libjs-framework7: 4.4.7-1
libproxmox-acme-perl: 1.4.2
libproxmox-acme-plugins: 1.4.2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-6
libpve-http-server-perl: 4.1-5
libxdgmime-perl: 1.0-1
lvm2: 2.03.11-2.1
pmg-docs: 7.1-2
pmg-i18n: 2.7-2
pmg-log-tracker: 2.3.1-1
postgresql-13: 13.8-0+deb11u1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.0-1
proxmox-spamassassin: 3.4.6-4
proxmox-widget-toolkit: 3.5.1
pve-firmware: 3.5-6
pve-xtermjs: 4.16.0-1
zfsutils-linux: 2.1.6-pve1

Any help to get the cluster back in sync would be greatly appreciated.
 
've seen some other threads and not sure if it's related - But I'm trying to figure out how to get my cluster to sync both ways again since it has not worked since 2022.11.11 at 07.28.
could you share the logs of that timeframe (-10 minutes till +10 minutes)?
that might help narrowing down where the issue is.

The error-message is odd -as it points to a location in the code where this error should not happen - could you share line 1093 (and the surrounding lines) of /usr/share/perl5/PMG/DBTools.pm

Did you change anything in the database manually? (i.e. not using the PMG tooling/GUI)?
 
Hi Stoiko

Thanks for reaching out; logs from mail-gw01 and mail-gw02 - attached as it's a lot of logs.

DBTools.pm - mail-gw01
Code:
sub update_master_clusterinfo {
    my ($clientcid) = @_;

    my $dbh = open_ruledb();

    $dbh->do("DELETE FROM ClusterInfo WHERE CID = $clientcid");

    my @mt = ('CMSReceivers', 'CGreylist', 'UserPrefs', 'DomainStat', 'DailyStat', 'LocalStat', 'VirusInfo');

    foreach my $table (@mt) {
        $dbh->do ("INSERT INTO ClusterInfo (cid, name, ivalue) select $clientcid, 'lastmt_$table', " .
                  "EXTRACT(EPOCH FROM now())");
    }
}

DBTools.pm - mail-gw02
Code:
sub update_master_clusterinfo {
    my ($clientcid) = @_;

    my $dbh = open_ruledb();

    $dbh->do("DELETE FROM ClusterInfo WHERE CID = $clientcid");

    my @mt = ('CMSReceivers', 'CGreylist', 'UserPrefs', 'DomainStat', 'DailyStat', 'LocalStat', 'VirusInfo');

    foreach my $table (@mt) {
        $dbh->do ("INSERT INTO ClusterInfo (cid, name, ivalue) select $clientcid, 'lastmt_$table', " .
                  "EXTRACT(EPOCH FROM now())");
    }
}

Did you change anything in the database manually? (i.e. not using the PMG tooling/GUI)?
No I have done no manual changes to the DB.

Again thanks for assisting.
 
Hi Stoiko

Thanks appreciate you taking the time.

They're attached - Only thing redacted is the blacklist and whitelist. Everything else should be generic.

modify field and notify are what followed pmg and has not been modified since these were installed quite some years ago.
Screenshot from 2022-11-23 12-18-33.png
 

Attachments

Last edited:
Thanks!

Do you by any chance know when you installed the upgrades?
(or could you share the /var/log/apt/history.log (or the rotated variant that captures the timeframe before the issue occured?)
 
They're pretty identical.

mail-gw01:
Code:
Start-Date: 2022-11-09  13:19:07
Commandline: apt dist-upgrade -y
Upgrade: pmg-api:amd64 (7.1-7, 7.1-8), pmg-gui:amd64 (3.1-4, 3.1-5)
End-Date: 2022-11-09  13:19:22

Start-Date: 2022-11-12  18:56:04
Commandline: apt dist-upgrade -y
Upgrade: libpixman-1-0:amd64 (0.40.0-1, 0.40.0-1.1~deb11u1)
End-Date: 2022-11-12  18:56:04

Start-Date: 2022-11-15  23:07:28
Commandline: apt dist-upgrade -y
Upgrade: grub-pc-bin:amd64 (2.06-3~deb11u2, 2.06-3~deb11u4), pmg-api:amd64 (7.1-8, 7.1-9), grub-efi-amd64-bin:amd64 (2.06-3~deb11u2, 2.06-3~deb11u4), grub2-common:amd64 (2.06-3~deb11u2, 2.06-3~deb11u4), libpve-http-server-perl:amd64 (4.1-4, 4.1-5), libpve-common-perl:amd64 (7.2-3, 7.2-5), grub-common:amd64 (2.06-3~deb11u2, 2.06-3~deb11u4), grub-efi-ia32-bin:amd64 (2.06-3~deb11u2, 2.06-3~deb11u4), grub-pc:amd64 (2.06-3~deb11u2, 2.06-3~deb11u4)
End-Date: 2022-11-15  23:07:46

mail-gw02:
Code:
Start-Date: 2022-11-09  13:19:15
Commandline: apt dist-upgrade -y
Upgrade: pmg-api:amd64 (7.1-7, 7.1-8), pmg-gui:amd64 (3.1-4, 3.1-5)
End-Date: 2022-11-09  13:20:02

Start-Date: 2022-11-12  18:55:56
Commandline: apt dist-upgrade -y
Upgrade: libpixman-1-0:amd64 (0.40.0-1, 0.40.0-1.1~deb11u1)
End-Date: 2022-11-12  18:55:57

Start-Date: 2022-11-15  23:07:55
Commandline: apt dist-upgrade -y
Upgrade: grub-pc-bin:amd64 (2.06-3~deb11u2, 2.06-3~deb11u4), pmg-api:amd64 (7.1-8, 7.1-9), grub-efi-amd64-bin:amd64 (2.06-3~deb11u2, 2.06-3~deb11u4), grub2-common:amd64 (2.06-3~deb11u2, 2.06-3~deb11u4), libpve-http-server-perl:amd64 (4.1-4, 4.1-5), libpve-common-perl:amd64 (7.2-3, 7.2-5), grub-common:amd64 (2.06-3~deb11u2, 2.06-3~deb11u4), grub-efi-ia32-bin:amd64 (2.06-3~deb11u2, 2.06-3~deb11u4), grub-pc:amd64 (2.06-3~deb11u2, 2.06-3~deb11u4)
End-Date: 2022-11-15  23:09:05

And the issue showed on the 11th So in between 7.1-7 -> 7.1-8 and 7.1-8 -> 7.1-9
 
ok - that helps - so the issue was introduced with 7.1-8 !

Thanks!
 
Sadly did not manage to reproduce the issue here - despite trying to match the potentially problematic mail as closely as possible...

the following commands should produce a text-listing of your db-tables, which might cause the issue:
Code:
psql Proxmox_ruledb --echo-queries -c "select * from clusterinfo" >> dbdump.txt; psql Proxmox_ruledb --echo-queries -c "select * from cmsreceivers" >> dbdump.txt;  for i in cmailstore cstatistic domainstat dailystat; do psql Proxmox_ruledb --echo-queries -c "select * from $i where time > 1668121200" >> dbdump.txt ; done

If you like - please run it, gzip the resulting dbdump.txt and share it (if you prefer send it via mail to s.ivanov _at_ proxmox.com)

Thanks
 
Thanks - I think the issue is with quarantined mails.

How long is your quarantine lifetime? (GUI->Configuration->Spam Detector->Quarantine)

I still wonder how the mails managed to get put in quarantine - since here the mail is just dropped (just like in the linked thread) - thus not breaking the cluster-sync

Do you have an LDAP profile configured?

In any case - depending on how long you want to wait/can wait:
* after the quarantine life-time days+1 from the moment you installed pmg-api 7.1-9 (2022-11-15) the issue should resolve itself
(all problematic mails will be purged by the pmgspamreport timer ...)
* else we could try to selectively remove old mails from the quarantine db and from the spooldir - but this is a bit involved and could cause inconsistencies
 
Spam quarantine is 31 days and has not been changed since I actually don't remember. And I agree it seem a bit odd.

No LDAP profiles configured.

So if I changed the quarantine time to ex. 7 days it should resolve itself?

Either way, I am willing to try what ever you prefer, just to ensure it wont happen again - And I guess since it's resolved in 7.1-9 it wont.
 
Spam quarantine is 31 days and has not been changed since I actually don't remember.
hm - from a quick look in the db-dump I think that most of the mails should have been dealt with (they have been delivered or deleted)?
could you check as Administrator in the GUI -> Spam Quarantine - are there still mails visible there?

if not - it would really be the simplest to set the quarantine lifetime to 7 days ; run `pmgqm purge` ; and then reset it to 31 days (I assume your users do expect this to stay that way)

And I guess since it's resolved in 7.1-9 it wont.

I would assume so - if not don't hesitate to post here :)
 
Last edited:
  • Like
Reactions: c0urier
No it's empty on both nodes.

Code:
root@mail-gw01:~> pmgqm purge
purging database
removed 142 spam quarantine files

Code:
root@mail-gw02:~> pmgqm purge
purging database
removed 118 spam quarantine files

And now we're back in sync.
Code:
NAME(CID)--------------IPADDRESS----ROLE-STATE---------UPTIME---LOAD----MEM---DISK
mail-gw02(1)         10.4.0.2        node   A   10 days 22:35   0.29    37%    22%
mail-gw01(2)         10.2.0.22       master A   10 days 22:35   0.00    38%    22%

And the log looks good:
Code:
Nov 23 17:33:14 mail-gw02 pmgmirror[870138]: database sync 'mail-gw01' failed - Wide character in subroutine entry at /usr/share/perl5/PMG/DBTools.pm line 1076.
Nov 23 17:33:16 mail-gw02 pmgmirror[870138]: cluster synchronization finished  (1 errors, 3.25 seconds (files 0.00, database 2.49, config 0.76))
Nov 23 17:35:13 mail-gw02 pmgmirror[870138]: starting cluster synchronization
Nov 23 17:35:18 mail-gw02 pmgmirror[870138]: cluster synchronization finished  (0 errors, 5.27 seconds (files 0.64, database 3.86, config 0.76))
Nov 23 17:37:13 mail-gw02 pmgmirror[870138]: starting cluster synchronization
Nov 23 17:37:17 mail-gw02 pmgmirror[870138]: cluster synchronization finished  (0 errors, 4.04 seconds (files 0.45, database 2.78, config 0.80))

Thank you, Stoiko - Really appreciated the effort! Awesome!
 
Last edited:
  • Like
Reactions: Stoiko Ivanov
Glad we figured that out!
should you run into something similar again - just reply here - maybe then we'll find the root-cause
 
  • Like
Reactions: c0urier
All good I'll keep that in mind. Again thanks!

We can still consider this resolved as everything is back to normal.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!