SPF temperror tipping legitimate mail into quarantine — what worked

Fra

Renowned Member
Dec 10, 2011
148
16
83
We chased a wave of false-positive spam quarantines on PMG 8.2.11 down to this pattern.

Sharing in case it saves someone a week, but, even better, maybe you spot a rabbit hole.

(Summary by Claude-Code: I believe using claude in this forum is not an issue.)

Symptom: legitimate mail held in spam quarantine. Headers show:

Code:
Received-SPF: temperror (...: Time-out on DNS 'TXT' lookup of '...')

and KAM_DMARC_REJECT (+7) firing. With a quarantine threshold of 5 this alone tips legit mail over.

Root cause is two layers, not one:
1. Local unbound in default recursive mode — cold-cache TXT lookups 300-1500 ms, full SPF chains up to 16 s for popular senders → hits SA's spf_timeout
5s → temperror. Fixable.

2. Remote authoritative DNS that simply doesn't answer — sendgrid.net, ab.sendgrid.net, eu.mailgun.org, enotice.ieee.org, accenture.com, plus a long
tail of mass-mailer and questionable senders. Unfixable from our side.


Three layered fixes:

(1) Tune local unbound — /etc/unbound/unbound.conf.d/pmg-tuning.conf:
Code:
  server:
      msg-cache-size: 64m
      rrset-cache-size: 128m
      neg-cache-size: 16m
      prefetch: yes
      prefetch-key: yes
      harden-glue: yes
      harden-dnssec-stripped: yes
      harden-referral-path: yes
      num-threads: 2
Then systemctl reload unbound. Cold-cache recursion drops to <50 ms after first hit; prefetch keeps common sender records always warm. Do NOT forward
unbound to 1.1.1.1 / 8.8.8.8 / 9.9.9.9 — the major DNSBLs (Spamhaus zen + DBL, URIBL, SURBL, DNSWL) still return "blocked public resolver" sentinels
for shared-upstream queries, silently breaking your spam filter in both directions at once. Pure recursive is the right stance per the PMG wiki since
at least 2022.


(2) Raise spf_timeout from 5 s to 15 s — /etc/mail/spamassassin/local.cf:
Code:
  spf_timeout 15
  Then systemctl restart pmg-smtp-filter. Default 5 s is too aggressive for chains with slow includes. Anything still failing at 15 s is a dead remote
  auth that no further value helps.

(3) Surgical SA meta-rule — same file, append:
Code:
  meta KAM_TEMPERROR_RESCUE  (KAM_DMARC_REJECT && T_SPF_TEMPERROR)
  score KAM_TEMPERROR_RESCUE -4.0
  describe KAM_TEMPERROR_RESCUE  Rescue when KAM_DMARC_REJECT was due to SPF temperror, not a real fail
Real spoof (SPF fail + DMARC reject) still scores +7 → quarantined. Temperror + DMARC reject scores +7 − 4 = +3 → delivered. Preserves anti-spoof
protection where it matters; compensates only where SPF couldn't actually be evaluated.

Honest observation from 90 hours of measurement: (1) is necessary but doesn't visibly reduce the temperror quarantine count on its own (the DNS layer
heals — synthetic probe p95 from ~1300 ms to ~0 ms — but the remote-DNS residual stays). (2) helps a thin slice. (3) is the actual move-the-needle
change.

PMG 8.2.11 on Debian 12, KAM ruleset installed. Should apply unchanged to PMG 9.
 
I just reject them temporarily with my policyguard.
If the error is a temp error, it will arrive approx. 5 min later.
If it was spam (and the sender dns is missconfigured), there is mostly no second try.