This investigation was surprising to me so I thought it would be interesting to share my findings and I hope you'll like it.
Some of my clients occasionally reported that the updown confirmation email (used to confirm a new email address, provided by Devise) had been classified as spam, we're talking about this one:
Doesn't look too spammy so far but sometimes mails servers running SpamAssassin were indeed reporting a rating above 5 on its "Spam-Score". 5 being the default threshold from SpamAssassin to consider an email as spam. If we have access to the raw email with headers, this is something we can often see easily (real example provided by one client):
X-Spam-Report:
* 0.0 HTML_MESSAGE BODY: Nachricht =?UTF-8?Q?enth=E4lt?= HTML
* 2.8 HTML_IMAGE_ONLY_28 BODY: HTML: images with 2400-2800 bytes of words
* -0.0 T_SCC_BODY_TEXT_LINE No description available.
* 4.0 URI_PHISH Phishing using web form
X-Spam-Score: 6.8
So I started investigating why SpamAssassin was applying these rules to this email and oh boy I wasn't ready for what I found π
I first tried reproducing the problem locally by installing SpamAssassin and running some checks on the exact same email from that client (example instructions used on Ubuntu 22.04):
> sudo apt install spamassassin
> spamassassin -V
SpamAssassin version 3.4.6
running on Perl version 5.34.0
> spamassassin -t < confirmation-instructions.eml
# ...
Content analysis details: (0.6 points, 5.0 required)
pts rule name description
---- ---------------------- --------------------------------------------------
-1.0 RCVD_IN_MSPIKE_H5 RBL: Excellent reputation (+5)
[104.245.209.212 listed in wl.mailspike.net]
-0.0 SPF_HELO_PASS SPF: HELO matches SPF record
0.7 HTML_IMAGE_ONLY_28 BODY: HTML: images with 2400-2800 bytes of
words
0.0 HTML_MESSAGE BODY: HTML included in message
-0.1 DKIM_VALID Message has at least one valid DKIM or DK signature
-0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from
author's domain
0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily
valid
-0.0 RCVD_IN_MSPIKE_WL Mailspike good senders
1.0 URI_PHISH Phishing using web form
Dissapointingly the result was very different and the score very low. We could still see the same impacting rules though (HTML_IMAGE_ONLY_28
and URI_PHISH
) but with lower scores.
I also tried with the -Lt
options which means "local-only test" (no calls to remote servers, online blacklists, etc.) and in that case there's fewer tests as expected but it increases the score of others:
> spamassassin -Lt < confirmation-instructions.eml
# ...
Content analysis details: (3.8 points, 5.0 required)
pts rule name description
---- ---------------------- --------------------------------------------------
0.0 HTML_MESSAGE BODY: HTML included in message
2.8 HTML_IMAGE_ONLY_28 BODY: HTML: images with 2400-2800 bytes of
words
1.0 URI_PHISH Phishing using web form
This is likely to make up for the fact that there's less signal to be used so they need to amplify the only available signals in order to reach the spam score threshold of 5 earlier, I guess.
If you do/can have a local DNS resolver, I would recommend making sure you enable network rules for more reliable results. If using
spampd
, this is configured withLOCALONLY=0
in/etc/default/spampd
So even though the scores were lower, I knew they could be multiplied for some reasons and also by configuration so better see if I can avoid the email being flagged as HTML_IMAGE_ONLY_28
and URI_PHISH
entirely to eliminate the problem.
I first wrote a quick way to test these emails spam scores in my specs (using spamd
the daemon version of SpamAssassin, and spamc
the command-line client). In order to be able to iterate and test changes quickly, but also to avoid regressions in the future. Future changes of my emails or future versions of SpamAssassin:
def spam_check email
# Using spamc/spamd (daemon) if available, much faster
cmd = "spamc --full --connect-retries=1"
# Using spamassassin (standalone cmd), slower but supports local-only option
# cmd = "spamassassin -Lt"
# Inject Received header to trigger more rules like __VIA_ML (return-path contains "bounces@")
stdin = "Received: by mta212a-ord.mtasv.net id h6qj0s27tk4a for <#{email.to.first}>; #{email.date.rfc2822} (envelope-from <pm_bounces@bounce.updown.io>)\n" + email.to_s
out, err, status = Open3.capture3(cmd, stdin_data: stdin)
if out == "0/0\n"
skip "spamd is not running: `sudo systemctl start spamassassin.service`"
elsif status.success?
# minor processing to have stable rules orders and remove width limit
headers, rules = out.chomp.split("--\n")
rules.gsub!(/\n\s{5,}/m, " ")
return headers + "\n" + rules.lines.sort.join
else
raise Error.new("Command `#{cmd}` exited with status #{status.to_i}: #{err}")
end
rescue Errno::ENOENT => e
skip "SpamAssassin not installed: #{e.to_s}"
end
require "rails_helper"
describe UserMailer do # Devise inherited mailer
let(:user) { create :user }
let(:email) { ActionMailer::Base.deliveries.last }
describe '#confirmation_instructions' do
subject { user }
it "passes spam check" do
subject
expect(spam_check(email)).to include(<<~REPORT)
Content analysis details: (0.0 points, 5.0 required)
pts rule name description
---- ---------------------- ------------------------------------------------
0.0 HTML_IMAGE_ONLY_32 BODY: HTML: images with 2800-3200 bytes of words
0.0 HTML_MESSAGE BODY: HTML included in message
-0.0 NO_RELAYS Informational: message was not relayed via SMTP
REPORT
end
end
end
Now let's have a look at these two rules. It's hard to find clear definitions sometimes but fortunately SpamAssassin is open source so where there is a will there's a way.
This one is the easiest and the most self-explanatory, it simply checks if the email contains an image (it does, the updown.io logo) and if the content is between 2000 and 2400 bytes. So basically if the email is short and has an image, it's more likely to be spam (this is because of spam email which hide text as images to avoid filters). Only two options here:
1. Remove the image
2. Increase the content length
I choose the later to keep a consistent look and also because of the second rule. In the end I only increased it a bit and now it matches the HTML_IMAGE_ONLY_32
rule, this rule scores 2.2 in local-only testing but 0 (surprisingly) when network test are enabled. (If we follow the same logic as HTML_IMAGE_ONLY_24
, it should have been 2.2/4 β 0.55)
Getting rid of this rule would require much more text bloat or cheating (invisible text, etc..) and it matches more of my emails, so for the moment I decided to leave it like that and wait for the next problem. 2.2 is not enough on it's own to trip the spam threshold (5) and hopefully spamassassin will improve this part before I need to hack around it.
Now for the most interesting part, after some online search I first found this which seems to be a plugin checking for URL against a blacklist, but it gives the URI_PHISHING
rule (not exactly the same) and I didn't install any plugin, so this is not the one.
I then found this very interesting report in 2021 about a similar confirmation email receiving a "false positive" classification as URI_PHISH
, and the official answer was:
It's not based on "phishing URLs" or the specific link, it's based on having body text that looks like account phishing and having a URL. The body text that looks suspiciously like phishing is, unsurprisingly, "confirm your account".
As Loren said, this is not a FP, as the total score for the message did not exceed the spam threshold. This is a single-rule hit on spammy-looking content without other signs to support it. That happens.
It is not a bug that a given rule will hit some ham. The only suggestion I can offer is that you reword your message to make it look less like phishing.
So let's skip over the fact that it is now very sad that anti-spam filters have to block any simple confirmation email just because scamers are successfully abusing people with them...
That piqued my curiousity: what are they looking for in the email exactly? how can I make sure that the change I make won't be matched by another rule or in the future? (yes we unfortunately have to think like scammers now in order to get our regular email accepted...)
So by searching for URI_PHISH
into the code I ended up in this big rules file which does contain this (extract slightly simplified):
meta __URI_PHISH __HAS_ANY_URI && !__URI_GOOGLE_DOC && !__URI_GOOG_STO_HTML && (__EMAIL_PHISH || __ACCT_PHISH)
meta URI_PHISH __URI_PHISH && !ALL_TRUSTED && !__UNSUB_LINK && !__TAG_EXISTS_CENTER && !__HAS_SENDER && !__CAN_HELP && !__VIA_ML && !__UPPERCASE_URI && !__HAS_CC && !__NUMBERS_IN_SUBJ && !__PCT_FOR_YOU && !__MOZILLA_MSGID && !__FB_COST && !__hk_bigmoney && !__REMOTE_IMAGE && !__HELO_HIGHPROFILE && !__RCD_RDNS_SMTP_MESSY && !__BUGGED_IMG && !__FB_TOUR && !__RCVD_DOTGOV_EXT
describe URI_PHISH Phishing using web form
score URI_PHISH 4.00 # limit
Ok so we now have an entry point which contains MANY other rules of course (some of which also contains other rules). I checked ALL of them for you ^^ and here are my most interesting findings:
__HAS_ANY_URI
β simple regexp on /^\w+:\/\//
__EMAIL_PHISH || __ACCT_PHISH
β these the sub rules where the main "phishing" heuristics happens__WEBMAIL_ACCT
, __MAILBOX_FULL
, __MAILBOX_FULL_SE
, __CLEAN_MAILBOX
, __VALIDATE_MAILBOX
, __VALIDATE_MBOX_SE
, __UPGR_MAILBOX
, __LOCK_MAILBOX
, __SYSADMIN
, __ATTN_MAIL_USER
, __MAIL_ACCT_ACCESS1
, __MAIL_ACCT_ACCESS2
, __ACCESS_REVOKE
, __PASSWORD_UPGRADE
, __PENDING_MESSAGES
, __RELEASE_MESSAGES
, __PASSWORD_EXP_CLUMSY
β these are all regexps for typical email scams (mailbox full, click here to regain access to your account, etc...), nothing matching in my email.__PDS_FROM_NAME_TO_DOMAIN
β οΈ this one is interesting, it triggers if the From name is equal to the To domain (for example if the emails is From βexample.comβ To "adrien@example.com"). β this is because many scam use that to make it look like the email comes from your "domain administrator". It wasn't the case for me here, but make sure you don't do that.__VERIFY_ACCOUNT
β β
this is the one matching our email so I had to change the wording to avoid it. The regexp is: /(?:confirm|updated?|verif(?:y|ied)) (?:your|the) (?:(?:account|current|billing|personal|online)? ?(?:records?|information|account|identity|access|data|login)|"?[^\@\s]+\@\S+"? (?:account|mail ?box)|confirm verification|verify k?now|Ihre Angaben .berpr.ft und best.tigt)/i
__FAILED_LOGINS
, __ACCOUNT_REACTIV
, __SECURITY_DEPT
, __ACCOUNT_ERROR
, __ACCOUNT_DISRUPT
, __ACCOUNT_UPGRADE
, __ACCOUNT_SECURE
, __SUSPICION_LOGIN
, __ACCESS_SUSPENDED
, __ACCESS_RESTORE
, __ACCESS_REVOKE
β another set of regexp for classic account scams based on fear, I made sure my "account locked" email does not match any of those.!
) which are meant to exclude content (if this rule is true, then the URI_PHISH
rule will NOT apply):!__URI_GOOGLE_DOC
and !__URI_GOOG_STO_HTML
β regexp on docs\.google\.com
and storage\.googleapis\.com
, they got their own special rule so are excluded here.!ALL_TRUSTED
β this is for when you configure some internal email servers as "trusted", not applicable here!__UNSUB_LINK
β β οΈ Also interesting, this one tries to match unsubscribe links with /\b(?:(?:un)?subscri(?:ber?|ptions?)|abuses?|opt(?:ing)?.?out)\b/i
. This is good to know that simply having an unsubscribe link could prevent URI_PHISH
, but unfortunately for an account confirmation email you can't really "unsubscribe" people, this is not a mailing list or on-boarding email. Otherwise this would have a been a good option to improve both the spam score and the user experience.!__VIA_ML
β this rule checks if the envelope-from/return-path contains "bounces@" to declare this is a "Mailing List". In my case using Postmark this is the case and cannot be customized unfortunately (only the domain: pm_bounces@bounce.updown.io
). So I guess you should avoid using "bounces@" in your return path addresses for transactional emails if you can... !__TAG_EXISTS_CENTER
β this rule just checks for the presence of a <center>
tag. So if you add one, magically your email is no longer URI_PHISH
... WAIT, WHAT? Surely if your email is centered the old way, then it's not phishing (tested locally).!__HAS_SENDER
β if you add ANY Sender
header, the URI_PHISH
rule is skipped⦠The goal of the Sender header is for services sending emails on behalf of other users, it helps for authentication validations. But anybody can put anything in here, so there's no reason to consider an email "less phishing" because it contains this header. (tested locally)!__CAN_HELP
β even simpler, this will skip the rule if the email contains "can help"... (tested locally)!__UPPERCASE_URI
β pretty self-explanatory!__HAS_CC
β what? why?!__NUMBERS_IN_SUBJ
β OK so more than 3 digits in subject line also helps... /\d{3}/
!__FB_COST
β this is one checks for the word... "cost". Yep, just that. Put it in an email and suddenly it's not phishing... (tested locally) !__FB_TOUR
β similarly this one checks for the word "tour"...It's likely that some of these rules are only here to replace URI_PHISH
by another one more specific maybe (like we saw the case with Google Doc URLs), but still in this state it's pretty easy to exploit them and in my testing locally, using those words to trigger those rules didn't cause other spam rules to appear...
Which means that in the end we have a spam filter which is very easy to fool, yet easily tripped by honest emails...
Sender
header (only for some emails and with the same value as From
) in order to please the rules because this one doesn't look too hackish, but I still don't feel great about this π.