PMG Server going deaf

dthompson

Well-Known Member
Nov 23, 2011
146
15
58
Canada
www.digitaltransitions.ca
Having an issue with a recent cluster that was built after I moved from my existing ISP over to OVHCloud.
Both servers are running Debian 11 and then have the Proxmox packages install on top.

The servers seem to be running fine, but the one server stops accepting and connecting on localhost and emails start adding up.

Apr 4 13:01:46 swarmx1 postfix/lmtp[17279]: 96DDCA1503: to=<user@domain.com>, relay=none, delay=0.05, delays=0.05/0/0/0, dsn=4.4.1, status=deferred (connect to 127.0.0.1[127.0.0.1]:10023: Connection refused)

There are 2 servers in the cluster:
1.) Located in Canada in the OVHCloud
2.) Located in UK in the OVHCloud

The sync between the 2 servers works perfectly fine. The main server thats having issues has the following services fail:

Code:
pmg-daily
pmg-smtp-filter
pmg-mirror
pmg-policy
pmg-tunnel

The PMG servers are as follows:
2 virtual cores with 4GB RAM.

I was thinking it might have been the ram since I had this issue last time with not enough RAM for the units, however, when I look at free and monitor it, it doesn't seem that its a memory issue as far as I can tell:

======================================================================================
Code:
free
               total        used        free      shared  buff/cache   available
Mem:         3922276     2395740     1152428       22104      374108     1287480
Swap:              0           0           0

======================================================================================

I am seeing this when I run the following:
Code:
dmesg -T | egrep -i 'killed process'
[Mon Apr  4 16:01:46 2022] Out of memory: Killed process 48643 (pmgqm) total-vm:142832kB, anon-rss:73712kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:264kB oom_score_adj:0
[Mon Apr  4 16:01:49 2022] Out of memory: Killed process 48631 (pmgqm) total-vm:142988kB, anon-rss:73820kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:264kB oom_score_adj:0
[Mon Apr  4 16:01:51 2022] Out of memory: Killed process 48626 (pmgqm) total-vm:149976kB, anon-rss:74848kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:280kB oom_score_adj:0
[Mon Apr  4 16:01:54 2022] Out of memory: Killed process 48607 (pmgqm) total-vm:154556kB, anon-rss:77076kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:288kB oom_score_adj:0
[Mon Apr  4 17:01:11 2022] Out of memory: Killed process 50552 (clamd) total-vm:1887432kB, anon-rss:1253688kB, file-rss:0kB, shmem-rss:0kB, UID:110 pgtables:2760kB oom_score_adj:0

^^ This error doesn't appear on my other PMG server. This also only seems to happen when the server is most busy receiving and sending emails.

Virus Detectors is set to the following:
Block Encrypted archives and documents: NO Max recursion: 5 Max files: 1000 Max file size: 25000000 Max scan size: 100000000 Max credit card numbers: 0

pmgversion -v
proxmox-mailgateway: 7.1-1 (API: 7.1-2/75d043b3, running kernel: 5.13.19-6-pve)
pmg-api: 7.1-2
pmg-gui: 3.1-2
pve-kernel-helper: 7.1-13
pve-kernel-5.13: 7.1-9
pve-kernel-5.13.19-6-pve: 5.13.19-14
clamav-daemon: 0.103.5+dfsg-0+deb11u1
ifupdown: 0.8.36
ifupdown2: residual config
libarchive-perl: 3.4.0-1
libjs-extjs: 7.0.0-1
libjs-framework7: 4.4.7-1
libproxmox-acme-perl: 1.4.1
libproxmox-acme-plugins: 1.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-5
libpve-http-server-perl: 4.1-1
libxdgmime-perl: 1.0-1
lvm2: not correctly installed
pmg-docs: 7.1-2
pmg-i18n: 2.6-2
pmg-log-tracker: 2.3.1-1
postgresql-13: 13.5-0+deb11u1
proxmox-mini-journalreader: 1.3-1
proxmox-spamassassin: 3.4.6-4
proxmox-widget-toolkit: 3.4-7
pve-firmware: 3.3-6
pve-xtermjs: 4.16.0-1

pmgcm status
NAME(CID)--------------IPADDRESS----ROLE-STATE---------UPTIME---LOAD----MEM---DISK
swarmx2(2) 54.36.163.110 node A 1 day 23:23 0.29 34% 5%
swarmx1(1) 51.79.49.82 master A 02:33 0.33 32% 5%

^^ Obviously when the pmg-mirror dies this changes to syncing again in the cluster section.

KVM on the VPS shows as follows (both primary and secondary server are identical in hardware)
======================================================================================
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 40 bits physical, 48 bits virtual
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 2
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 60
Model name: Intel Core Processor (Haswell, no TSX)
Stepping: 1
CPU MHz: 2399.998
BogoMIPS: 4799.99
Virtualization: VT-x
Hypervisor vendor: KVM
Virtualization type: full
======================================================================================


This issue doesn't appear on the secondary PMG server in the cluster, only on the primary one. I'm not sure if it's a case of too much load. I'd be happy to add more RAM, but I'm not sure thats the issue at play here.

I'm also happy to provide any logs / updates required so that I can get this issue nailed down.

Currently RAM Usage is showing as 64% in use, so not excessive.
The load average on the units is also quit low (0.26)

Thoughts?
 
The PMG servers are as follows:
2 virtual cores with 4GB RAM.
4 GB might be a bit tight
[Mon Apr 4 17:01:11 2022] Out of memory: Killed process 50552 (clamd)
especially clamav is quite a memory hog

if it's easily possible could you increase the memory (to 6 GB)?

else - do you have any modifications to the PMG setup (more clamav signatures, other services installed in parallel to the PMG packages?)

if increasing the memory does not fix the issue could you provide the journal since booting until the first OOM messages appear?
(journalctl -b)

I hope this helps!
 
So it might be the RAM. You're right. I see when things go off, that calmed cranks up to 99% CPU usage and memory gets hits hard.

However that being said, I think I've found the issue on my end. When I had my old server, I have pmgqm set to send emails to certain users ever hour. On my older server on one domain, it was never set properly. It was as follows;

1 8-20 * * 1-7 /usr/bin/pmgqm send --receivernathan &>/dev/null

With my old server, because I could add more RAM to the server and CPU resources easily I think cron and the pmgqm command just hammered through the error and kept rolling. Now that its on a lower RAM server, I think its harder for it to overcome this. As soon as I resolved the error, on the following how, the emails went through. The CPU and RAM still went up, but nothing stopped or stalled. This is great.

1 8-20 * * 1-7 /usr/bin/pmgqm send --receiver nathan@domain.com &>/dev/null

I'll keep monitoring this, but I think I was able to resolve the error. I still might up the RAM on both servers just for piece of mind for future updates to the Proxmox services and to ensure things keep running smoothly.

That being said, with my command that I run, is there a better way to do this for all users on the system or for a particular domain? The domain in question that was causing me issues has about 30 users and each one is setup as above so its just basically killed the server when it couldn't deliver the spam notifications.

If there is a better way to do this in version 7 of PMG, I'd love to hear a best practices way to accomplish this.

Thanks for the help!!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!