Tracker Timeout 90% of the time

MiamiJack · Jul 27, 2021

Hello All,

90% of the time when I go to the track and want to get a quick look at how many messages came in, in some cases from a certain sender ( to insure a new rule allowed mail to flow properly) I get a time out, here is 1 example:
command '/usr/bin/pmg-log-tracker -v -l 2000 -s 1627352460 -e 1627358400 -n -g -f anything@something.com' failed: got timeout (500)

Sometimes I do also search for both the sender and recipient, but still getting timeouts.
I use the default settings which look to search in the past 24 hours roughly.

For reference maybe 10-15K messages come in during the day.
System is Virtual running under vmware, 1 processor 4GB.
Processor looks good (photo attached)
I did see memory reaching around 80%, so I bumped it up to 6GB.
You can see in the 2nd image, each bum in memory still peaked to the same point (4-5-6GB).

Is what I'm having an issue with more about memory or cpu, or maybe something else?

I do look at the box during the day with TOP runs fine, nothing peaking too high, memory looking ok, disk ok.

FYI another test, I do a fresh reboot, within 1 minute of the reboot, you can see all system resources are looking great, run the same search in tracking center times out.

Any suggestions or ideas appreciated!

Thanks!

dcsapak · Jul 27, 2021

how long does it take you to execute the given command on the cli?

what is probably limiting is the storage, since the log-tracker must read the mail logs from disk and if the log is big and the disk is slow this can take a while

MiamiJack · Jul 28, 2021

Looks like 13 seconds, but I don't get any results, searching ONLY from. ( a few may have taken slightly longer in my multiple attempts, but nothing serious like 30 seconds )

Code:

۲root@mgw: Tue Jul 27 10:26:47pm
~/ /usr/bin/pmg-log-tracker -v -l 2000 -s 1627435260 -e 1627444800 -n -g -f user@domain.com
# LogReader: 20532
# Query options
# Sender: noreply@matador.ai
# Start: 2021-07-27 21:21:00 (1627420860)
# End: 2021-07-28 00:00:00 (1627430400)
# End Query Options
root@mgw: Tue Jul 27 10:32:44pm

While I have seen it take longer, there were zero results for messages I know went through the server from different users.

t.lamprecht · Jul 28, 2021

I only saw that there was already an answer once I wrote below, but I figured that it does not hurt to still send it.

MiamiJack said:
For reference maybe 10-15K messages come in during the day.

A bit of traffic but not that much..

MiamiJack said:
System is Virtual running under vmware, 1 processor 4GB.

Single core is not really ideal, the tracker may not come around if postfix/clamav is currently processing mail.
In general, I try to assign at least two cores to avoid issues here, but actually...

MiamiJack said:
Processor looks good (photo attached)

...IO wait spices quite a bit though, that could mean the disk serving the log files is the bottleneck.
The second manual attempt you tried could be faster as then some data would be already cached.

More memory can help a bit to keep the page caches for file access "warm", only useful after at least one requests to the tracker was made though.

Could you post the output of

Bash:

head -n -0 /proc/pressure/*; \
time pmg-log-tracker -v -l 2000 -s 1627435260 -e 1627444800 -n -g -f user@domain.com; \
head -n -0 /proc/pressure/*

I.e., checking pressure stall info immediately before and then again after running a manual search, please adapt the params of the search so that they make sense

MiamiJack · Aug 19, 2021

@t.lamprecht
Thanks for your response and I apologize for the delay.

In this case the UI came back with a timeout, but the command line did return some data.
This was done immediately after the failure and below are the results.

On a side note, your points about CPU & memory will be reviewed, and we will add a CPU & play with memory as we did before.

BEFORE:
==> /proc/pressure/cpu <==
some avg10=2.17 avg60=6.89 avg300=7.75 total=118260296184

==> /proc/pressure/io <==
some avg10=1.28 avg60=16.43 avg300=7.41 total=12904699364
full avg10=0.97 avg60=13.15 avg300=5.73 total=5485536979

==> /proc/pressure/memory <==
some avg10=0.00 avg60=0.00 avg300=0.00 total=205894479
full avg10=0.00 avg60=0.00 avg300=0.00 total=65412576

AFTER:
==> /proc/pressure/cpu <==
some avg10=19.00 avg60=8.22 avg300=7.78 total=118265019669

==> /proc/pressure/io <==
some avg10=93.35 avg60=56.34 avg300=19.37 total=12946567479
full avg10=53.31 avg60=43.16 avg300=15.20 total=5518273827

==> /proc/pressure/memory <==
some avg10=0.00 avg60=0.01 avg300=0.00 total=205929029
full avg10=0.00 avg60=0.01 avg300=0.00 total=65441455

t.lamprecht · Aug 23, 2021

From the output you posted we can derive that memory is not the limiting factor directly as no process had to wait for memory availability. IO bandwidth/latency on the other hand made lots of processes stall when the could run. Also the CPU is already congested a bit before the command was run and quite a bit more afterwards. I.e., all those number from some/full avg10, avg60 and avg300 should be < 1, everything else suggests a (temporarily) overloaded system.

So, to solve this bottleneck you probably want to see if you can assign at least one additional core to that VM, which should be relatively easy, and ideally investigate if storage access performance could be increased. The latter will surely have a bigger impact but normally also harder to actually implement compared to setting the VM CPU core from 1 to 2.

MiamiJack · Sep 1, 2021

While I'm stilling getting similar performance for this function, the 7.X version seems to run a good amount faster/smoother.

MiamiJack · Jun 19, 2023

Reopening this as it will fail more than not.
Upgrade VM to 2 CPU's and 9GB Ram, but still getting the timeout.
Could we maybe have it wait a little longer before timing out?

t.lamprecht · Jun 20, 2023

MiamiJack said:
Upgrade VM to 2 CPU's and 9GB Ram, but still getting the timeout.

As mentioned the last time, memory wasn't the main limiting factor at all, it was partly CPU (which should be better now) and mainly IO bandwidth/latency.

Can you again do the test from my post above and this time also include the runtime of pmg-log-tracker (i.e., what the prefixed time outputs)?
There we can see what's the bottleneck now, and what's the actual time required for pmg-log-tracker to run.

MiamiJack · Jun 20, 2023

Thanks for the quick response.
I have run it twice with 2 users, but if this is looking at user being the sender, maybe we should make this as recipient since they dont relay out the MGW.

Code:

head -n -0 /proc/pressure/*; \
time pmg-log-tracker -v -l 2000 -s 1627435260 -e 1627444800 -n -g -f user1@user1.com; \
head -n -0 /proc/pressure/*
==> /proc/pressure/cpu <==
some avg10=19.96 avg60=32.97 avg300=17.16 total=113408907639
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

==> /proc/pressure/io <==
some avg10=0.75 avg60=0.79 avg300=0.28 total=15863457958
full avg10=0.01 avg60=0.03 avg300=0.00 total=12472469242

==> /proc/pressure/memory <==
some avg10=0.14 avg60=0.11 avg300=0.02 total=1349875489
full avg10=0.00 avg60=0.00 avg300=0.00 total=957879593
# LogReader: 2520775
# Query options
# Sender: user1@user1.com
# Start: 2021-07-27 21:21:00 (1627420860)
# End: 2021-07-28 00:00:00 (1627430400)
# End Query Options


real    0m0.136s
user    0m0.006s
sys     0m0.004s
==> /proc/pressure/cpu <==
some avg10=19.96 avg60=32.97 avg300=17.16 total=113408908210
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

==> /proc/pressure/io <==
some avg10=0.75 avg60=0.79 avg300=0.28 total=15863580565
full avg10=0.01 avg60=0.03 avg300=0.00 total=12472591289

==> /proc/pressure/memory <==
some avg10=0.14 avg60=0.11 avg300=0.02 total=1349875489
full avg10=0.00 avg60=0.00 avg300=0.00 total=957879593
۲▒▒ root@mgw ۲▒▒ Tue Jun 20 11:31:56am
~/ head -n -0 /proc/pressure/*; time pmg-log-tracker -v -l 2000 -s 1627435260 -e 1627444800 -n -g -f user2@user2.com; head -n -0 /proc/pressure/*
==> /proc/pressure/cpu <==
some avg10=6.86 avg60=25.66 avg300=16.41 total=113410081041
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

==> /proc/pressure/io <==
some avg10=0.49 avg60=0.76 avg300=0.31 total=15863650113
full avg10=0.36 avg60=0.20 avg300=0.04 total=12472649543

==> /proc/pressure/memory <==
some avg10=0.02 avg60=0.08 avg300=0.02 total=1349875501
full avg10=0.00 avg60=0.00 avg300=0.00 total=957879605
# LogReader: 2520810
# Query options
# Sender: user2@user2.com
# Start: 2021-07-27 21:21:00 (1627420860)
# End: 2021-07-28 00:00:00 (1627430400)
# End Query Options


real    0m0.004s
user    0m0.000s
sys     0m0.004s
==> /proc/pressure/cpu <==
some avg10=6.86 avg60=25.66 avg300=16.41 total=113410087907
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

==> /proc/pressure/io <==
some avg10=0.49 avg60=0.76 avg300=0.31 total=15863650113
full avg10=0.36 avg60=0.20 avg300=0.04 total=12472649543

==> /proc/pressure/memory <==
some avg10=0.02 avg60=0.08 avg300=0.02 total=1349875501
full avg10=0.00 avg60=0.00 avg300=0.00 total=957879605

MiamiJack · Oct 26, 2023

Hello All,

I really need to get this resolved.
I want to ask some questions about how this works and see if it will help me better understand the issue.
Are we reading these details from a DB or are we actually querying the log file and trying to gather the data?

What I can share is:

sometimes* when I query from command line, I will get a result
sometimes* after multiple failed searches from GUI or combined w/GUI and command line (also fail) it will eventually come back with a response.
Adding memory, adding another virtual CPU, changing to SSD vs nice raid array, no difference.
Running Latest PMG 8.x

Because of the above is why I asked the question about DB.

Last piece of information is today (26th day of the month) I have 11+ Million lines in my mail.log.

Hopefully someone can help me to figure this out in order to be able to more easily get answers when we are tracking.

Thanks!

MiamiJack · Dec 10, 2023

@t.lamprecht
I'm still having the problem, and it is becoming impossible to get information in a reasonable amount of time.
If you search, wait a while, search again, wait a while, maybe 20 minutes later you can get the information, simply not acceptable.

I take into consideration what day of the month the search was, and this was on the 5th, so the log file isn't that big, today being the 10th the mail.log is 4GB+ so my question is would there not be a value of putting the log into a DB?

t.lamprecht · Dec 11, 2023

MiamiJack said:
If you search, wait a while, search again, wait a while, maybe 20 minutes later you can get the information, simply not acceptable.

Again, the CPU of that host is totally starved:

MiamiJack said:
some avg10=19.96 avg60=32.97 avg300=17.16 total=113408908210

In average there were over 17 processes that wanted to run but couldn't because there was not enough CPU time available.

As long as this host is so under powered it just won't get faster...

MiamiJack · Dec 18, 2023

Thank you for the update, so I changed the VM to 4 CPU's from 2 and will further research.

My question will be related to the actual log file.
is this function not searching the log file on the fly?
My feeling from working with the system is if I search the log file in linux, then search in the tracker either right after or even for a short while later, I get results quickly, but if I didn't search in the tracker since the day before, it simply times out.
While YES, I agree CPU can get hit pretty hard doing a grep per say, isn't this part of the issue?
I say this because the mail.log file is hitting a million lines or more as we get towards the end of the month.

MiamiJack · Dec 20, 2023

I added the additional CPU's making it 4 and initially since searching command line kind of caches the log I was able to search very quickly.
Today when I started searching for a different email address issue, it timed out, multiple times.
The current mail.log file (month to date rotation ) is 9,031,756 lines.
This is why I ask my question of should this not be directly put into a DB, which would provide instant access to all the information.

Search

Search

Tracker Timeout 90% of the time

MiamiJack

Active Member

Attachments

dcsapak

Proxmox Staff Member

MiamiJack

Active Member

t.lamprecht

Proxmox Staff Member

MiamiJack

Active Member

t.lamprecht

Proxmox Staff Member

MiamiJack

Active Member

MiamiJack

Active Member

t.lamprecht

Proxmox Staff Member

MiamiJack

Active Member

MiamiJack

Active Member

MiamiJack

Active Member

t.lamprecht

Proxmox Staff Member

MiamiJack

Active Member

MiamiJack

Active Member

We value your privacy

Tracker Timeout 90% of the time

Active Member

Attachments

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Active Member

Proxmox Staff Member

Active Member

Active Member

Active Member

​

Proxmox Staff Member

Active Member

Active Member

We value your privacy