MX got timeout

plnt

Member
Jan 20, 2022
13
3
8
31
Hi guys,

We have a long-term problem with our MX cluster 8.2.0 because MX1 is getting constant timeouts. And not only MX1 but also other nodes in the cluster.
The website is slow and getting error messages with timeouts. Connecting via API where ANTISPAM is set on domains and adding and removing domains from transport also ends with timeouts. Adding anything via API is va banque.
I wanted to somehow increase the PHP Pool on MX servers for the proxy service but I couldn't do it. It only has 3 workers, which seems a bit low to me.

Example of error from API >

Bash:
[Thu Jul 10 08:03:03.887018 2025] [proxy_fcgi:error] [pid 2710157:tid 2710167] [remote 194.1.216.65:56741] AH01071: Got error 'PHP message: PHP Fatal error: Uncaught PVE2_Exception: Not logged into Proxmox host. No Login access ticket found or ticket expired. in /var/www/clients/client21/web65/web/class/pmg.php:147\nStack trace:\n#0 /var/www/clients/client21/web65/web/class/pmg.php(543): PMGService->action()\n#1 /var/www/clients/client21/web65/web/email_content_antispam.php(74): PMGService->get()\n#2 {main}\n thrown in /var/www/clients/client21/web65/web/class/pmg.php on line 147', referer: https://ourwebsite.sk/hosting-email

[Thu Jul 10 08:38:58.813208 2025] [proxy_fcgi:error] [pid 2710100:tid 2710108] [remote 195.160.182.66:51563] AH01071: Got error 'PHP message: PHP Fatal error: Uncaught PVE2_Exception: Not logged into Proxmox host. No Login access ticket found or ticket expired. in /var/www/clients/client21/web65/web/class/pmg.php:147\nStack trace:\n#0 /var/www/clients/client21/web65/web/class/pmg.php(543): PMGService->action()\n#1 /var/www/clients/client21/web65/web/email_content_antispam.php(74): PMGService->get()\n#2 {main}\n thrown in /var/www/clients/client21/web65/web/class/pmg.php on line 147', referer: https://ourwebsite.sk/hosting-email

[Thu Jul 10 08:38:58.817329 2025] [proxy_fcgi:error] [pid 1733339:tid 1733347] [remote 176.116.114.29:14378] AH01071: Got error 'PHP message: PHP Fatal error: Uncaught PVE2_Exception: Not logged into Proxmox host. No Login access ticket found or ticket expired. in /var/www/clients/client21/web65/web/class/pmg.php:147\nStack trace:\n#0 /var/www/clients/client21/web65/web/class/pmg.php(543): PMGService->action()\n#1 /var/www/clients/client21/web65/web/email_content_antispam.php(74): PMGService->get()\n#2 {main}\n thrown in /var/www/clients/client21/web65/web/class/pmg.php on line 147', referer: https://ourwebsite.sk/hosting-email

[Thu Jul 10 08:39:28.081610 2025] [proxy_fcgi:error] [pid 1155067:tid 1155147] [remote 176.116.114.29:14438] AH01071: Got error 'PHP message: PHP Fatal error: Uncaught PVE2_Exception: Not logged into Proxmox host. No Login access ticket found or ticket expired. in /var/www/clients/client21/web65/web/class/pmg.php:147\nStack trace:\n#0 /var/www/clients/client21/web65/web/class/pmg.php(543): PMGService->action()\n#1 /var/www/clients/client21/web65/web/email_content_antispam.php(74): PMGService->get()\n#2 {main}\n thrown in /var/www/clients/client21/web65/web/class/pmg.php on line 147', referer: https://ourwebsite.sk/hosting-email

[Thu Jul 10 08:41:15.807996 2025] [proxy_fcgi:error] [pid 2710100:tid 2710107] [remote 195.160.182.66:51563] AH01071: Got error 'PHP message: PHP Fatal error: Uncaught PVE2_Exception: Not logged into Proxmox host. No Login access ticket found or ticket expired. in /var/www/clients/client21/web65/web/class/pmg.php:147\nStack trace:\n#0 /var/www/clients/client21/web65/web/class/pmg.php(543): PMGService->action()\n#1 /var/www/clients/client21/web65/web/email_content_antispam.php(74): PMGService->get()\n#2 {main}\n thrown in /var/www/clients/client21/web65/web/class/pmg.php on line 147', referer: https://ourwebsite.sk/hosting-email
 

Attachments

  • Screenshot 2025-07-15 at 9.14.00 AM.png
    Screenshot 2025-07-15 at 9.14.00 AM.png
    72.3 KB · Views: 11
  • Screenshot 2025-07-15 at 9.17.55 AM.png
    Screenshot 2025-07-15 at 9.17.55 AM.png
    10.4 KB · Views: 11
  • Screenshot 2025-07-15 at 9.20.39 AM.png
    Screenshot 2025-07-15 at 9.20.39 AM.png
    40.1 KB · Views: 10
  • Screenshot 2025-07-15 at 9.26.56 AM.png
    Screenshot 2025-07-15 at 9.26.56 AM.png
    7.5 KB · Views: 10
  • Screenshot 2025-07-15 at 9.30.36 AM.png
    Screenshot 2025-07-15 at 9.30.36 AM.png
    534.2 KB · Views: 11
The errors you're seeing, like the "Not logged into Proxmox host" issue, indicate authentication problems with your API calls. Have you checked if there's a session timeout or if the authentication tokens are expiring too quickly?


Regarding your PHP pool configuration, increasing the number of workers can indeed help with handling more concurrent requests. What steps did you take when you tried to increase the PHP pool, and what issues did you encounter?


Also, for your MX cluster timeouts, have you reviewed the system logs to see if there are any resource constraints or network issues causing these timeouts across different nodes?
 
We checked the API call and it seems to be OK. The voice does not correspond to reality. It simply times out. And the web interface does that.

I tried various ways to raise workers, e.g. via >

Bash:
systemctl status pmgproxy | grep worker


mkdir /etc/systemd/system/pmgproxy.service.d
vim /etc/systemd/system/pmgproxy.service.d/override.conf

[Service]
Environment="MAX_WORKERS=8" "MAX_CONN=1500" "MAX_REQUESTS=20000"

systemctl daemon-reload
systemctl restart pmgproxy

But without success. Sometimes the service did not start, other times it did but with default parameters.

I have not yet noticed any performance problems and the same is true for the entire infrastructure network.
 
Hi guys,

Can anyone help? I need to solve this. I don't even know if it's possible to run PMG in such a large cluster with such a large number of domains.
 
  • Could you please check the logs to see why it’s timing out:
Code:
journalctl -u pmgproxy -f
grep 'error' /var/log/pmg/pmgproxy/pmgproxy.log

  • If the problem is too many connections, use NGINX in front of PMG to handle more traffic. Don’t try to change the built-in proxy.
  • Only if nothing else works, you can edit PMG/HTTPServer.pm to raise max_workers, but updates will undo your changes.
 
I don't have an error log there either.
But in
Code:
/var/log/pmgproxy/pmgproxy.log
I have a lot of good logs from the API.

Code:
::ffff:172.16.8.9 - setup@pmg [01/10/2025:22:31:28 +0200] "DELETE /api2/json/config/ruledb/who/1992/objects/158924 HTTP/1.1" 200 13

Code:
::ffff:172.16.8.9 - setup@pmg [06/10/2025:06:20:30 +0200] "GET /api2/json/config/transport HTTP/1.1" 200 2396297And so on.Of course we also use HAProxy in front of MX (for management) but it doesn't matter.

And so on.
Of course we also use HAProxy in front of MX (for management) but it doesn't matter.

We have a bug there that adds a 3-second lag to every PMG response, and which we certainly won't eliminate by adding RAM or CPU. I tested it for a long time, I polled from different servers, at different times, to different mxs, even from the localhost mx to itself, and the response always takes about 10-50 milliseconds PLUS THREE SECONDS. It doesn't matter if it's Sunday at midnight, when no one is there, or now on Monday morning, when everyone is working on mail. It's never 2.5 seconds or less, it's never 3.5 seconds or more. The response always takes between 3.01 and 3.05 seconds. Those 0.01 to 0.05 seconds depend on the current load, but those 3 seconds are a bug that we should eliminate. I also tried it directly on MX, I'll run it from anywhere (put 127.0.0.1 instead of mx3.webhouse.sk), and look at the value of "time_starttransfer"

Code:
curl -sk -o /dev/null -w ' time_namelookup: %{time_namelookup}s time_connect: %{time_connect}s time_appconnect: %{time_appconnect}s time_starttransfer:%{time_starttransfer}s time_total: %{time_total}s ' https://mx3.webhouse.sk:8006/api2/json/version

This is result >
Code:
time_namelookup: 0.002022s time_connect: 0.004828s time_appconnect: 0.053946s time_starttransfer:3.058493s time_total: 3.058712s
 
Hi,

any ideas?

At this stage the entire MX cluster is completely unusable. The web interface is completely unusable, and mail processing time is also slowing down - 8sec.

I think the default settings of the internal PostgreSQL are completely wrong and maybe that's what's causing it.
 
I'll answer myself.
IT WOULD BE GOOD IF PMG TEAM WORKED THIS AS A DEFAULT INSTALLATION.

Because obviously making a product on a default installation of postgresql 15 is very irresponsible.

1. I first checked if my performance was not affected by disk latency.
Bash:
iostat -x 1 10
pidstat -d 1 10

2. Next, debug the SQL environment. If you see high seq_scan, lots of dead rows, or checkpoints too often, it's a configuration issue.
SQL:
psql -d postgres -c "SELECT now(), * FROM pg_stat_activity WHERE state <> 'idle' LIMIT 20;"
psql -d postgres -c "SELECT relname,n_dead_tup,n_live_tup,autovacuum_count FROM pg_stat_all_tables ORDER BY n_dead_tup DESC LIMIT 20;"
psql -d postgres -c "SELECT schemaname,relname,idx_scan,seq_scan FROM pg_stat_user_tables ORDER BY seq_scan DESC LIMIT 20;"
psql -d postgres -c "SELECT * FROM pg_stat_bgwriter\gx"

3. PMG template editing and synchronization
Bash:
mkdir -p /etc/pmg/templates
cp /var/lib/pmg/templates/postgresql.conf.in /etc/pmg/templates/
vim /etc/pmg/templates/postgresql.conf.in
pmgconfig sync --restart 1

4. PSQL tuning (approx. 16Gb)
Bash:
# Memory
shared_buffers = 2GB
effective_cache_size = 6GB
work_mem = 16MB
maintenance_work_mem = 512MB
autovacuum_work_mem = 256MB

# WAL / checkpoint
wal_compression = on
checkpoint_timeout = 15min
max_wal_size = 4GB
min_wal_size = 1GB
checkpoint_completion_target = 0.9

# I/O a paralel
effective_io_concurrency = 200   # SSD/NVMe
random_page_cost = 1.1           # SSD
max_worker_processes = 8
max_parallel_workers_per_gather = 2
max_parallel_workers = 4

# Autovacuum
autovacuum_naptime = 20s
autovacuum_vacuum_scale_factor = 0.05
autovacuum_analyze_scale_factor = 0.02

# Logging/diag
log_min_duration_statement = 500ms
log_checkpoints = on

And than restart PSQL service >

Bash:
HOSTS="mx1 mx2 mx3 mx4 mx5 mx6 mx7 mx8 mx9 mx10 mx11 mx12 mx13 mx14 mx15 mx16 mx17"

for h in $HOSTS; do
  ssh $h "systemctl restart postgresql"
done

5. QUERY tuning
Bash:
HOSTS="mx1 mx2 mx3 mx4 mx5 mx6 mx7 mx8 mx9 mx10 mx11 mx12 mx13 mx14 mx15 mx16 mx17"
for h in $HOSTS; do
  ssh $h "sudo -u postgres -H psql -d postgres -c \"CREATE EXTENSION IF NOT EXISTS pg_stat_statements;\""
done

HOSTS="mx1 mx2 mx3 mx4 mx5 mx6 mx7 mx8 mx9 mx10 mx11 mx12 mx13 mx14 mx15 mx16 mx17"
for h in $HOSTS; do
  ssh $h '
    psql -d Proxmox_ruledb -v ON_ERROR_STOP=1 -c "CREATE INDEX CONCURRENTLY IF NOT EXISTS object_objectgroup_id_id_idx ON public.object(objectgroup_id,id);" &&
    psql -d Proxmox_ruledb -c "CREATE INDEX CONCURRENTLY IF NOT EXISTS objectgroup_attributes_gid_idx ON public.objectgroup_attributes(objectgroup_id);" &&
    psql -d Proxmox_ruledb -c "CREATE INDEX CONCURRENTLY IF NOT EXISTS rulegroup_ruleid_grp_obj_idx ON public.rulegroup(rule_id,grouptype,objectgroup_id);" &&
    psql -d Proxmox_ruledb -c "CREATE INDEX CONCURRENTLY IF NOT EXISTS rule_attributes_ruleid_name_idx ON public.rule_attributes(rule_id,name);" &&
    psql -d Proxmox_ruledb -c "CREATE INDEX CONCURRENTLY IF NOT EXISTS cgreylist_ipnet_cidr_sr_idx ON public.cgreylist ((IPNet::cidr), Sender, Receiver);" &&
    psql -d Proxmox_ruledb -c "CREATE INDEX CONCURRENTLY IF NOT EXISTS domainstat_mtime_idx ON public.domainstat(mtime);" &&
    psql -d Proxmox_ruledb -c "CREATE INDEX CONCURRENTLY IF NOT EXISTS domainstat_time_domain_idx ON public.domainstat(time,domain);" &&
    psql -d Proxmox_ruledb -c "VACUUM (ANALYZE) public.object, public.objectgroup_attributes, public.rulegroup, public.rule_attributes, public.cgreylist, public.domainstat;"
  '
done

6. Increasing the limit to avoid the "Broken Pipe" error on all guests
Bash:
HOSTS="mx1 mx2 mx3 mx4 mx5 mx6 mx7 mx8 mx9 mx10 mx11 mx12 mx13 mx14 mx15 mx16 mx17"

for h in $HOSTS; do
  ssh $h 'printf "[Manager]\nDefaultLimitNOFILE=16384\n" > /etc/systemd/system.conf &&
          systemctl daemon-reexec &&
          systemctl daemon-reload &&
          systemctl restart pmgdaemon pmgproxy'
done

Check
Bash:
for h in $HOSTS; do
  echo "== $h ==";
  ssh $h 'cat /proc/$(pidof pmgdaemon)/limits | grep "open files"';
done


These are all basic steps that need to be done on EVERY new PMG installation.
 
Last edited: