[SOLVED] Cluster unable to sync both ways since 11nov

c0urier · Nov 22, 2022

I've seen some other threads and not sure if it's related - But I'm trying to figure out how to get my cluster to sync both ways again since it has not worked since 2022.11.11 at 07.28.
The error I see is;

Code:

Nov 11 07:28:00 mail-gw02 pmgmirror[1017497]: cluster synchronization finished  (0 errors, 3.99 seconds (files 0.39, database 2.86, config 0.74))
Nov 11 07:29:56 mail-gw02 pmgmirror[1017497]: starting cluster synchronization
Nov 11 07:29:57 mail-gw02 pmgmirror[1017497]: database sync 'mail-gw01' failed - Wide character in subroutine entry at /usr/share/perl5/PMG/DBTools.pm line 1093.
Nov 11 07:29:59 mail-gw02 pmgmirror[1017497]: cluster synchronization finished  (1 errors, 3.27 seconds (files 0.00, database 2.54, config 0.73))
Nov 11 07:31:56 mail-gw02 pmgmirror[1017497]: starting cluster synchronization
Nov 11 07:31:58 mail-gw02 pmgmirror[1017497]: detected rule database changes - starting sync from '10.2.0.22'
Nov 11 07:31:58 mail-gw02 pmgmirror[1017497]: finished rule database sync from host '10.2.0.22'
Nov 11 07:31:58 mail-gw02 pmgmirror[1017497]: database sync 'mail-gw01' failed - Wide character in subroutine entry at /usr/share/perl5/PMG/DBTools.pm line 1093.
Nov 11 07:32:00 mail-gw02 pmgmirror[1017497]: cluster synchronization finished  (1 errors, 3.48 seconds (files 0.00, database 2.73, config 0.75))

And that just continues until this day.

From mail-gw02 to mail-gw01 everything is fine.

Code:

root@mail-gw02:~> pmgcm status
NAME(CID)--------------IPADDRESS----ROLE-STATE---------UPTIME---LOAD----MEM---DISK
mail-gw01(2)         10.2.0.22       master A    9 days 21:25   0.18    38%    21%
mail-gw02(1)         10.4.0.2        node   S    9 days 21:25   0.16    38%    22%

Code:

root@mail-gw02:~> pmgversion  -v
proxmox-mailgateway: 7.1-2 (API: 7.1-9/e0c0be55, running kernel: 5.15.64-1-pve)
pmg-api: 7.1-9
pmg-gui: 3.1-6
pve-kernel-5.15: 7.2-14
pve-kernel-helper: 7.2-14
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
clamav-daemon: 0.103.7+dfsg-0+deb11u1
ifupdown: 0.8.36+pve2
libarchive-perl: 3.4.0-1
libjs-extjs: 7.0.0-1
libjs-framework7: 4.4.7-1
libproxmox-acme-perl: 1.4.2
libproxmox-acme-plugins: 1.4.2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-6
libpve-http-server-perl: 4.1-5
libxdgmime-perl: 1.0-1
lvm2: 2.03.11-2.1
pmg-docs: 7.1-2
pmg-i18n: 2.7-2
pmg-log-tracker: 2.3.1-1
postgresql-13: 13.8-0+deb11u1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.0-1
proxmox-spamassassin: 3.4.6-4
proxmox-widget-toolkit: 3.5.1
pve-firmware: 3.5-6
pve-xtermjs: 4.16.0-1
zfsutils-linux: 2.1.6-pve1

Any help to get the cluster back in sync would be greatly appreciated.

Stoiko Ivanov · Nov 22, 2022

c0urier said:
've seen some other threads and not sure if it's related - But I'm trying to figure out how to get my cluster to sync both ways again since it has not worked since 2022.11.11 at 07.28.

could you share the logs of that timeframe (-10 minutes till +10 minutes)?
that might help narrowing down where the issue is.

The error-message is odd -as it points to a location in the code where this error should not happen - could you share line 1093 (and the surrounding lines) of /usr/share/perl5/PMG/DBTools.pm

Did you change anything in the database manually? (i.e. not using the PMG tooling/GUI)?

c0urier · Nov 23, 2022

Hi Stoiko

Thanks for reaching out; logs from mail-gw01 and mail-gw02 - attached as it's a lot of logs.

DBTools.pm - mail-gw01

Code:

sub update_master_clusterinfo {
    my ($clientcid) = @_;

    my $dbh = open_ruledb();

    $dbh->do("DELETE FROM ClusterInfo WHERE CID = $clientcid");

    my @mt = ('CMSReceivers', 'CGreylist', 'UserPrefs', 'DomainStat', 'DailyStat', 'LocalStat', 'VirusInfo');

    foreach my $table (@mt) {
        $dbh->do ("INSERT INTO ClusterInfo (cid, name, ivalue) select $clientcid, 'lastmt_$table', " .
                  "EXTRACT(EPOCH FROM now())");
    }
}

DBTools.pm - mail-gw02

Code:

sub update_master_clusterinfo {
    my ($clientcid) = @_;

    my $dbh = open_ruledb();

    $dbh->do("DELETE FROM ClusterInfo WHERE CID = $clientcid");

    my @mt = ('CMSReceivers', 'CGreylist', 'UserPrefs', 'DomainStat', 'DailyStat', 'LocalStat', 'VirusInfo');

    foreach my $table (@mt) {
        $dbh->do ("INSERT INTO ClusterInfo (cid, name, ivalue) select $clientcid, 'lastmt_$table', " .
                  "EXTRACT(EPOCH FROM now())");
    }
}

Did you change anything in the database manually? (i.e. not using the PMG tooling/GUI)?

No I have done no manual changes to the DB.

Again thanks for assisting.

Stoiko Ivanov · Nov 23, 2022

Thanks for the information!

I think it's related to the following mail-id: 2078C636DEBD107785
and probably related to:
https://forum.proxmox.com/threads/mail-flow-broken-on-7-1-8-patch-on-specific-case.117796/

* could you maybe share a (redacted) output of your `pmgdb dump` (this is the ruleset)
* do you have any 'modify field' or 'notify' actions?

I'll try to reproduce the issue with the versions you listed above.

c0urier · Nov 23, 2022

Hi Stoiko

Thanks appreciate you taking the time.

They're attached - Only thing redacted is the blacklist and whitelist. Everything else should be generic.

modify field and notify are what followed pmg and has not been modified since these were installed quite some years ago.

Stoiko Ivanov · Nov 23, 2022

Thanks!

Do you by any chance know when you installed the upgrades?
(or could you share the /var/log/apt/history.log (or the rotated variant that captures the timeframe before the issue occured?)

c0urier · Nov 23, 2022

They're pretty identical.

mail-gw01:

Code:

Start-Date: 2022-11-09  13:19:07
Commandline: apt dist-upgrade -y
Upgrade: pmg-api:amd64 (7.1-7, 7.1-8), pmg-gui:amd64 (3.1-4, 3.1-5)
End-Date: 2022-11-09  13:19:22

Start-Date: 2022-11-12  18:56:04
Commandline: apt dist-upgrade -y
Upgrade: libpixman-1-0:amd64 (0.40.0-1, 0.40.0-1.1~deb11u1)
End-Date: 2022-11-12  18:56:04

Start-Date: 2022-11-15  23:07:28
Commandline: apt dist-upgrade -y
Upgrade: grub-pc-bin:amd64 (2.06-3~deb11u2, 2.06-3~deb11u4), pmg-api:amd64 (7.1-8, 7.1-9), grub-efi-amd64-bin:amd64 (2.06-3~deb11u2, 2.06-3~deb11u4), grub2-common:amd64 (2.06-3~deb11u2, 2.06-3~deb11u4), libpve-http-server-perl:amd64 (4.1-4, 4.1-5), libpve-common-perl:amd64 (7.2-3, 7.2-5), grub-common:amd64 (2.06-3~deb11u2, 2.06-3~deb11u4), grub-efi-ia32-bin:amd64 (2.06-3~deb11u2, 2.06-3~deb11u4), grub-pc:amd64 (2.06-3~deb11u2, 2.06-3~deb11u4)
End-Date: 2022-11-15  23:07:46

mail-gw02:

Code:

Start-Date: 2022-11-09  13:19:15
Commandline: apt dist-upgrade -y
Upgrade: pmg-api:amd64 (7.1-7, 7.1-8), pmg-gui:amd64 (3.1-4, 3.1-5)
End-Date: 2022-11-09  13:20:02

Start-Date: 2022-11-12  18:55:56
Commandline: apt dist-upgrade -y
Upgrade: libpixman-1-0:amd64 (0.40.0-1, 0.40.0-1.1~deb11u1)
End-Date: 2022-11-12  18:55:57

Start-Date: 2022-11-15  23:07:55
Commandline: apt dist-upgrade -y
Upgrade: grub-pc-bin:amd64 (2.06-3~deb11u2, 2.06-3~deb11u4), pmg-api:amd64 (7.1-8, 7.1-9), grub-efi-amd64-bin:amd64 (2.06-3~deb11u2, 2.06-3~deb11u4), grub2-common:amd64 (2.06-3~deb11u2, 2.06-3~deb11u4), libpve-http-server-perl:amd64 (4.1-4, 4.1-5), libpve-common-perl:amd64 (7.2-3, 7.2-5), grub-common:amd64 (2.06-3~deb11u2, 2.06-3~deb11u4), grub-efi-ia32-bin:amd64 (2.06-3~deb11u2, 2.06-3~deb11u4), grub-pc:amd64 (2.06-3~deb11u2, 2.06-3~deb11u4)
End-Date: 2022-11-15  23:09:05

And the issue showed on the 11th So in between 7.1-7 -> 7.1-8 and 7.1-8 -> 7.1-9

Stoiko Ivanov · Nov 23, 2022

ok - that helps - so the issue was introduced with 7.1-8 !

Thanks!

Stoiko Ivanov · Nov 23, 2022

Sadly did not manage to reproduce the issue here - despite trying to match the potentially problematic mail as closely as possible...

the following commands should produce a text-listing of your db-tables, which might cause the issue:

Code:

psql Proxmox_ruledb --echo-queries -c "select * from clusterinfo" >> dbdump.txt; psql Proxmox_ruledb --echo-queries -c "select * from cmsreceivers" >> dbdump.txt;  for i in cmailstore cstatistic domainstat dailystat; do psql Proxmox_ruledb --echo-queries -c "select * from $i where time > 1668121200" >> dbdump.txt ; done

If you like - please run it, gzip the resulting dbdump.txt and share it (if you prefer send it via mail to s.ivanov _at_ proxmox.com)

Thanks

c0urier · Nov 23, 2022

I've sent an email with a dump from both nodes - Hopefully it will help.

Stoiko Ivanov · Nov 23, 2022

Thanks - I think the issue is with quarantined mails.

How long is your quarantine lifetime? (GUI->Configuration->Spam Detector->Quarantine)

I still wonder how the mails managed to get put in quarantine - since here the mail is just dropped (just like in the linked thread) - thus not breaking the cluster-sync

Do you have an LDAP profile configured?

In any case - depending on how long you want to wait/can wait:
* after the quarantine life-time days+1 from the moment you installed pmg-api 7.1-9 (2022-11-15) the issue should resolve itself
(all problematic mails will be purged by the pmgspamreport timer ...)
* else we could try to selectively remove old mails from the quarantine db and from the spooldir - but this is a bit involved and could cause inconsistencies

c0urier · Nov 23, 2022

Spam quarantine is 31 days and has not been changed since I actually don't remember. And I agree it seem a bit odd.

No LDAP profiles configured.

So if I changed the quarantine time to ex. 7 days it should resolve itself?

Either way, I am willing to try what ever you prefer, just to ensure it wont happen again - And I guess since it's resolved in 7.1-9 it wont.

Stoiko Ivanov · Nov 23, 2022

c0urier said:
Spam quarantine is 31 days and has not been changed since I actually don't remember.

hm - from a quick look in the db-dump I think that most of the mails should have been dealt with (they have been delivered or deleted)?
could you check as Administrator in the GUI -> Spam Quarantine - are there still mails visible there?

if not - it would really be the simplest to set the quarantine lifetime to 7 days ; run `pmgqm purge` ; and then reset it to 31 days (I assume your users do expect this to stay that way)

And I guess since it's resolved in 7.1-9 it wont.

I would assume so - if not don't hesitate to post here

c0urier · Nov 23, 2022

No it's empty on both nodes.

Code:

root@mail-gw01:~> pmgqm purge
purging database
removed 142 spam quarantine files

Code:

root@mail-gw02:~> pmgqm purge
purging database
removed 118 spam quarantine files

And now we're back in sync.

Code:

NAME(CID)--------------IPADDRESS----ROLE-STATE---------UPTIME---LOAD----MEM---DISK
mail-gw02(1)         10.4.0.2        node   A   10 days 22:35   0.29    37%    22%
mail-gw01(2)         10.2.0.22       master A   10 days 22:35   0.00    38%    22%

And the log looks good:

Code:

Nov 23 17:33:14 mail-gw02 pmgmirror[870138]: database sync 'mail-gw01' failed - Wide character in subroutine entry at /usr/share/perl5/PMG/DBTools.pm line 1076.
Nov 23 17:33:16 mail-gw02 pmgmirror[870138]: cluster synchronization finished  (1 errors, 3.25 seconds (files 0.00, database 2.49, config 0.76))
Nov 23 17:35:13 mail-gw02 pmgmirror[870138]: starting cluster synchronization
Nov 23 17:35:18 mail-gw02 pmgmirror[870138]: cluster synchronization finished  (0 errors, 5.27 seconds (files 0.64, database 3.86, config 0.76))
Nov 23 17:37:13 mail-gw02 pmgmirror[870138]: starting cluster synchronization
Nov 23 17:37:17 mail-gw02 pmgmirror[870138]: cluster synchronization finished  (0 errors, 4.04 seconds (files 0.45, database 2.78, config 0.80))

Thank you, Stoiko - Really appreciated the effort! Awesome!

Stoiko Ivanov · Nov 23, 2022

Glad we figured that out!
should you run into something similar again - just reply here - maybe then we'll find the root-cause

c0urier · Nov 23, 2022

All good I'll keep that in mind. Again thanks!

We can still consider this resolved as everything is back to normal.

[SOLVED] Cluster unable to sync both ways since 11nov

c0urier

Renowned Member

Stoiko Ivanov

Proxmox Staff Member

c0urier

Renowned Member

Stoiko Ivanov

Proxmox Staff Member

c0urier

Renowned Member

Attachments

Stoiko Ivanov

Proxmox Staff Member

c0urier

Renowned Member

Stoiko Ivanov

Proxmox Staff Member

Stoiko Ivanov

Proxmox Staff Member

c0urier

Renowned Member

Stoiko Ivanov

Proxmox Staff Member

c0urier

Renowned Member

Stoiko Ivanov

Proxmox Staff Member

c0urier

Renowned Member

Stoiko Ivanov

Proxmox Staff Member

c0urier

Renowned Member

We value your privacy