Problem - Extract text from attachments

Raito00

Active Member
Sep 10, 2019
28
2
43
44
Hi!
I enable new option - Extract text from attachments
but when i send test word .docx with bad words inside ... nothing happening.

How can i test if this option: Extract text from attachments running/working?

## my config info ##

@mailgw:~# pmgversion -v proxmox-mailgateway: 7.3-1 (API: 7.3-3/a3d66da0, running kernel: 5.15.104-1-pve) pmg-api: 7.3-3 pmg-gui: 3.3-2 pve-kernel-5.15: 7.4-1 pve-kernel-helper: 7.3-8 pve-kernel-5.13: 7.1-9 pve-kernel-5.15.104-1-pve: 5.15.104-1 pve-kernel-5.15.102-1-pve: 5.15.102-1 pve-kernel-5.15.85-1-pve: 5.15.85-1 pve-kernel-5.13.19-6-pve: 5.13.19-15 pve-kernel-5.13.19-1-pve: 5.13.19-3 clamav-daemon: 0.103.8+dfsg-0+deb11u1 ifupdown2: 3.1.0-1+pmx3 libarchive-perl: 3.4.0-1 libjs-extjs: 7.0.0-1 libjs-framework7: 4.4.7-1 libproxmox-acme-perl: 1.4.4 libproxmox-acme-plugins: 1.4.4 libpve-apiclient-perl: 3.2-1 libpve-common-perl: 7.3-4 libpve-http-server-perl: 4.2-1 libxdgmime-perl: 1.0-1 lvm2: 2.03.11-2.1 pmg-docs: 7.3-2 pmg-i18n: 2.12-1 pmg-log-tracker: 2.3.2-1 postgresql-13: 13.9-0+deb11u1 proxmox-mini-journalreader: 1.3-1 proxmox-offline-mirror-helper: 0.5.1-1 proxmox-spamassassin: 4.0.0-2 proxmox-widget-toolkit: 3.6.5 pve-firmware: 3.6-4 pve-xtermjs: 4.16.0-1 zfsutils-linux: 2.1.9-pve1
 
can you post the log from that email? what 'Bad words' did you use?

it sill has to match some spamassassin rules that would normally match in the email text too
 
I have custom rules like this:
###
rawbody __PP_LOCAL_UNWANTED_WORDS_16 /\b(crazy bad word 1|crazy2 bad2 word2|crazy3 bad3 word3)\b/i
meta PP_LOCAL_UNWANTED_WORDS_16 __PP_LOCAL_UNWANTED_WORDS_16 >= 1
describe PP_LOCAL_UNWANTED_WORDS_16 Bad word description
score PP_LOCAL_UNWANTED_WORDS_16 7
tflags __PP_LOCAL_UNWANTED_WORDS_16 multiple maxhits=1
###

And this custom rule working perfect for mail text
 
then please post a snippet from the journal where you get that mail, any errors there?
 
Code:
Apr 13 16:09:04 mailgw postfix/postscreen[200098]: PASS OLD [209.85.222.52]:33527
Apr 13 16:09:04 mailgw postfix/smtpd[200423]: connect from mail-ua1-f52.google.com[209.85.222.52]
Apr 13 16:09:05 mailgw pmgpolicy[200172]: SPF says pass
Apr 13 16:09:05 mailgw postfix/smtpd[200423]: 8F05D1E0604: client=mail-ua1-f52.google.com[209.85.222.52]
Apr 13 16:09:05 mailgw postfix/cleanup[200109]: 8F05D1E0604: message-id=<CALZri7GcGSS7Tp36_0oBWmcpdJfbKd0ohAX2-euXFG7MLpytZQ@mail.gmail.com>
Apr 13 16:09:05 mailgw postfix/qmgr[825]: 8F05D1E0604: from=<temp_raito00@gmail.com>, size=20700, nrcpt=1 (queue active)
Apr 13 16:09:05 mailgw pmg-smtp-filter[200429]: 2023/04/13-16:09:05 CONNECT TCP Peer: "[127.0.0.1]:53858" Local: "[127.0.0.1]:10024"
Apr 13 16:09:05 mailgw pmg-smtp-filter[200429]: 1E09416437FEF1B0707: new mail message-id=<CALZri7GcGSS7Tp36_0oBWmcpdJfbKd0ohAX2-euXFG7MLpytZQ@mail.gmail.com>
Apr 13 16:09:06 mailgw postfix/postscreen[200098]: CONNECT from [185.189.237.249]:57891 to [10.10.0.109]:25
Apr 13 16:09:06 mailgw pmg-smtp-filter[200429]: 1E09416437FEF1B0707: SA score=0/5 time=0.530 bayes=0.00 autolearn=ham autolearn_force=no hits=BAYES_00(-1.9),DKIM_SIGNED(0.1),DKIM_VALID(-0.1),DKIM_VALID_AU(-0.1),DKIM_VALID_EF(-0.1),DMARC_PASS(-0.1),FREEMAIL_FROM(0.001),HTML_MESSAGE(0.001),RCVD_IN_DNSWL_HI(-0.2),RCVD_IN_MSPIKE_H2(-0.001),SPF_HELO_NONE(0.001),SPF_PASS(-0.3),T_FREEMAIL_DOC_PDF(0.01)
Apr 13 16:09:06 mailgw postfix/smtpd[200124]: connect from localhost.localdomain[127.0.0.1]
Apr 13 16:09:06 mailgw postfix/smtpd[200124]: 4BF1E1E11F0: client=localhost.localdomain[127.0.0.1], orig_client=mail-ua1-f52.google.com[209.85.222.52]
Apr 13 16:09:06 mailgw postfix/cleanup[200189]: 4BF1E1E11F0: message-id=<CALZri7GcGSS7Tp36_0oBWmcpdJfbKd0ohAX2-euXFG7MLpytZQ@mail.gmail.com>
Apr 13 16:09:06 mailgw postfix/qmgr[825]: 4BF1E1E11F0: from=<temp_raito00@gmail.com>, size=21937, nrcpt=1 (queue active)
Apr 13 16:09:06 mailgw postfix/smtpd[200124]: disconnect from localhost.localdomain[127.0.0.1] ehlo=1 xforward=1 mail=1 rcpt=1 data=1 commands=5
Apr 13 16:09:06 mailgw pmg-smtp-filter[200429]: 1E09416437FEF1B0707: accept mail to <raito00@raito00.gg> (4BF1E1E11F0) (rule: default-accept)
Apr 13 16:09:06 mailgw pmg-smtp-filter[200429]: 1E09416437FEF1B0707: processing time: 0.634 seconds (0.53, 0.03, 0)
Apr 13 16:09:06 mailgw postfix/lmtp[200192]: 8F05D1E0604: to=<raito00@raito00.gg>, relay=127.0.0.1[127.0.0.1]:10024, delay=0.82, delays=0.14/0/0.04/0.64, dsn=2.5.0, status=sent (250 2.5.0 OK (1E09416437FEF1B0707))
Apr 13 16:09:06 mailgw postfix/qmgr[825]: 8F05D1E0604: removed
Apr 13 16:09:06 mailgw postfix/smtp[200200]: 4BF1E1E11F0: to=<raito00@raito00.gg>, relay=10.10.0.102[10.10.0.102]:25, delay=0.36, delays=0.05/0/0.06/0.25, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 4Py0Jy4fmlz1c0Zk)
Apr 13 16:09:06 mailgw postfix/qmgr[825]: 4BF1E1E11F0: removed
 
In the log i found other error about PDF:
Apr 13 17:34:37 mailgw pmg-smtp-filter[211415]: WARNING: extracttext: error (99) from /usr/bin/pdftotext: Syntax Error: Invalid XRef entry 3
 
i thought it was a docx?
is that log from the tracking center or journalctl/syslog? (please post from the syslog/journal

the warning seems to say that the pdf is probably malformed and can't be parsed, so it cannot detect the text ?
 
i thought it was a docx?
is that log from the tracking center or journalctl/syslog? (please post from the syslog/journal

the warning seems to say that the pdf is probably malformed and can't be parsed, so it cannot detect the text ?
When i`m sending some test .docx nothing showing in syslog ...

But i found other mail sender ...
So i found this errors:
Apr 13 16:32:03 mailgw pmg-smtp-filter[201124]: WARNING: extracttext: error (2) from /usr/bin/docx2txt: Failed to extract required information from </tmp/.spamassassin201124ouN1xTtmp>!
#
Apr 13 10:19:31 mailgw pmg-smtp-filter[15057]: WARNING: extracttext: error (1) from /usr/bin/tesseract: Premature end of JPEG file
#
 
well i guess the specific documents and images are simply not properly parseable/scannable by the tools that are in use, i guess we cannot really do anything about that from our side
you could try to report errors with the faulty documents to the respective projects (docx2txt/tesseract/pdftotext/etc.)
 
On a hunch - how large are the files you're trying to scan - and what is your Max Spam Size setting (GUI->Configuration->Spam Detector->Options)?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!