Having an issue with a recent cluster that was built after I moved from my existing ISP over to OVHCloud.
Both servers are running Debian 11 and then have the Proxmox packages install on top.
The servers seem to be running fine, but the one server stops accepting and connecting on localhost and emails start adding up.
Apr 4 13:01:46 swarmx1 postfix/lmtp[17279]: 96DDCA1503: to=<user@domain.com>, relay=none, delay=0.05, delays=0.05/0/0/0, dsn=4.4.1, status=deferred (connect to 127.0.0.1[127.0.0.1]:10023: Connection refused)
There are 2 servers in the cluster:
1.) Located in Canada in the OVHCloud
2.) Located in UK in the OVHCloud
The sync between the 2 servers works perfectly fine. The main server thats having issues has the following services fail:
The PMG servers are as follows:
2 virtual cores with 4GB RAM.
I was thinking it might have been the ram since I had this issue last time with not enough RAM for the units, however, when I look at free and monitor it, it doesn't seem that its a memory issue as far as I can tell:
======================================================================================
======================================================================================
I am seeing this when I run the following:
^^ This error doesn't appear on my other PMG server. This also only seems to happen when the server is most busy receiving and sending emails.
Virus Detectors is set to the following:
pmgversion -v
proxmox-mailgateway: 7.1-1 (API: 7.1-2/75d043b3, running kernel: 5.13.19-6-pve)
pmg-api: 7.1-2
pmg-gui: 3.1-2
pve-kernel-helper: 7.1-13
pve-kernel-5.13: 7.1-9
pve-kernel-5.13.19-6-pve: 5.13.19-14
clamav-daemon: 0.103.5+dfsg-0+deb11u1
ifupdown: 0.8.36
ifupdown2: residual config
libarchive-perl: 3.4.0-1
libjs-extjs: 7.0.0-1
libjs-framework7: 4.4.7-1
libproxmox-acme-perl: 1.4.1
libproxmox-acme-plugins: 1.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-5
libpve-http-server-perl: 4.1-1
libxdgmime-perl: 1.0-1
lvm2: not correctly installed
pmg-docs: 7.1-2
pmg-i18n: 2.6-2
pmg-log-tracker: 2.3.1-1
postgresql-13: 13.5-0+deb11u1
proxmox-mini-journalreader: 1.3-1
proxmox-spamassassin: 3.4.6-4
proxmox-widget-toolkit: 3.4-7
pve-firmware: 3.3-6
pve-xtermjs: 4.16.0-1
pmgcm status
NAME(CID)--------------IPADDRESS----ROLE-STATE---------UPTIME---LOAD----MEM---DISK
swarmx2(2) 54.36.163.110 node A 1 day 23:23 0.29 34% 5%
swarmx1(1) 51.79.49.82 master A 02:33 0.33 32% 5%
^^ Obviously when the pmg-mirror dies this changes to syncing again in the cluster section.
KVM on the VPS shows as follows (both primary and secondary server are identical in hardware)
======================================================================================
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 40 bits physical, 48 bits virtual
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 2
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 60
Model name: Intel Core Processor (Haswell, no TSX)
Stepping: 1
CPU MHz: 2399.998
BogoMIPS: 4799.99
Virtualization: VT-x
Hypervisor vendor: KVM
Virtualization type: full
======================================================================================
This issue doesn't appear on the secondary PMG server in the cluster, only on the primary one. I'm not sure if it's a case of too much load. I'd be happy to add more RAM, but I'm not sure thats the issue at play here.
I'm also happy to provide any logs / updates required so that I can get this issue nailed down.
Currently RAM Usage is showing as 64% in use, so not excessive.
The load average on the units is also quit low (0.26)
Thoughts?
Both servers are running Debian 11 and then have the Proxmox packages install on top.
The servers seem to be running fine, but the one server stops accepting and connecting on localhost and emails start adding up.
Apr 4 13:01:46 swarmx1 postfix/lmtp[17279]: 96DDCA1503: to=<user@domain.com>, relay=none, delay=0.05, delays=0.05/0/0/0, dsn=4.4.1, status=deferred (connect to 127.0.0.1[127.0.0.1]:10023: Connection refused)
There are 2 servers in the cluster:
1.) Located in Canada in the OVHCloud
2.) Located in UK in the OVHCloud
The sync between the 2 servers works perfectly fine. The main server thats having issues has the following services fail:
Code:
pmg-daily
pmg-smtp-filter
pmg-mirror
pmg-policy
pmg-tunnel
The PMG servers are as follows:
2 virtual cores with 4GB RAM.
I was thinking it might have been the ram since I had this issue last time with not enough RAM for the units, however, when I look at free and monitor it, it doesn't seem that its a memory issue as far as I can tell:
======================================================================================
Code:
free
total used free shared buff/cache available
Mem: 3922276 2395740 1152428 22104 374108 1287480
Swap: 0 0 0
======================================================================================
I am seeing this when I run the following:
Code:
dmesg -T | egrep -i 'killed process'
[Mon Apr 4 16:01:46 2022] Out of memory: Killed process 48643 (pmgqm) total-vm:142832kB, anon-rss:73712kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:264kB oom_score_adj:0
[Mon Apr 4 16:01:49 2022] Out of memory: Killed process 48631 (pmgqm) total-vm:142988kB, anon-rss:73820kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:264kB oom_score_adj:0
[Mon Apr 4 16:01:51 2022] Out of memory: Killed process 48626 (pmgqm) total-vm:149976kB, anon-rss:74848kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:280kB oom_score_adj:0
[Mon Apr 4 16:01:54 2022] Out of memory: Killed process 48607 (pmgqm) total-vm:154556kB, anon-rss:77076kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:288kB oom_score_adj:0
[Mon Apr 4 17:01:11 2022] Out of memory: Killed process 50552 (clamd) total-vm:1887432kB, anon-rss:1253688kB, file-rss:0kB, shmem-rss:0kB, UID:110 pgtables:2760kB oom_score_adj:0
^^ This error doesn't appear on my other PMG server. This also only seems to happen when the server is most busy receiving and sending emails.
Virus Detectors is set to the following:
Block Encrypted archives and documents: NO
Max recursion: 5
Max files: 1000
Max file size: 25000000
Max scan size: 100000000
Max credit card numbers: 0
pmgversion -v
proxmox-mailgateway: 7.1-1 (API: 7.1-2/75d043b3, running kernel: 5.13.19-6-pve)
pmg-api: 7.1-2
pmg-gui: 3.1-2
pve-kernel-helper: 7.1-13
pve-kernel-5.13: 7.1-9
pve-kernel-5.13.19-6-pve: 5.13.19-14
clamav-daemon: 0.103.5+dfsg-0+deb11u1
ifupdown: 0.8.36
ifupdown2: residual config
libarchive-perl: 3.4.0-1
libjs-extjs: 7.0.0-1
libjs-framework7: 4.4.7-1
libproxmox-acme-perl: 1.4.1
libproxmox-acme-plugins: 1.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-5
libpve-http-server-perl: 4.1-1
libxdgmime-perl: 1.0-1
lvm2: not correctly installed
pmg-docs: 7.1-2
pmg-i18n: 2.6-2
pmg-log-tracker: 2.3.1-1
postgresql-13: 13.5-0+deb11u1
proxmox-mini-journalreader: 1.3-1
proxmox-spamassassin: 3.4.6-4
proxmox-widget-toolkit: 3.4-7
pve-firmware: 3.3-6
pve-xtermjs: 4.16.0-1
pmgcm status
NAME(CID)--------------IPADDRESS----ROLE-STATE---------UPTIME---LOAD----MEM---DISK
swarmx2(2) 54.36.163.110 node A 1 day 23:23 0.29 34% 5%
swarmx1(1) 51.79.49.82 master A 02:33 0.33 32% 5%
^^ Obviously when the pmg-mirror dies this changes to syncing again in the cluster section.
KVM on the VPS shows as follows (both primary and secondary server are identical in hardware)
======================================================================================
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 40 bits physical, 48 bits virtual
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 2
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 60
Model name: Intel Core Processor (Haswell, no TSX)
Stepping: 1
CPU MHz: 2399.998
BogoMIPS: 4799.99
Virtualization: VT-x
Hypervisor vendor: KVM
Virtualization type: full
======================================================================================
This issue doesn't appear on the secondary PMG server in the cluster, only on the primary one. I'm not sure if it's a case of too much load. I'd be happy to add more RAM, but I'm not sure thats the issue at play here.
I'm also happy to provide any logs / updates required so that I can get this issue nailed down.
Currently RAM Usage is showing as 64% in use, so not excessive.
The load average on the units is also quit low (0.26)
Thoughts?