updown.io – Website monitoring, simple and inexpensive

Blog The funny rules of SpamAssassin in 2023 (deep dive)

This investigation was surprising to me so I thought it would be interesting to share my findings and I hope you'll like it.

Some of my clients occasionally reported that the updown confirmation email (used to confirm a new email address, provided by Devise) had been classified as spam, we're talking about this one:

confirmation email screenshot

Doesn't look too spammy so far but sometimes mails servers running SpamAssassin were indeed reporting a rating above 5 on its "Spam-Score". 5 being the default threshold from SpamAssassin to consider an email as spam. If we have access to the raw email with headers, this is something we can often see easily (real example provided by one client):

X-Spam-Report: 
    *  0.0 HTML_MESSAGE BODY: Nachricht =?UTF-8?Q?enth=E4lt?= HTML
    *  2.8 HTML_IMAGE_ONLY_28 BODY: HTML: images with 2400-2800 bytes of words
    * -0.0 T_SCC_BODY_TEXT_LINE No description available.
    *  4.0 URI_PHISH Phishing using web form
X-Spam-Score: 6.8

So I started investigating why SpamAssassin was applying these rules to this email and oh boy I wasn't ready for what I found 😅

I first tried reproducing the problem locally by installing SpamAssassin and running some checks on the exact same email from that client (example instructions used on Ubuntu 22.04):

> sudo apt install spamassassin

> spamassassin -V
SpamAssassin version 3.4.6
  running on Perl version 5.34.0

> spamassassin -t < confirmation-instructions.eml
# ...
Content analysis details:   (0.6 points, 5.0 required)

 pts rule name              description
---- ---------------------- --------------------------------------------------
-1.0 RCVD_IN_MSPIKE_H5      RBL: Excellent reputation (+5)
                            [104.245.209.212 listed in wl.mailspike.net]
-0.0 SPF_HELO_PASS          SPF: HELO matches SPF record
 0.7 HTML_IMAGE_ONLY_28     BODY: HTML: images with 2400-2800 bytes of
                            words
 0.0 HTML_MESSAGE           BODY: HTML included in message
-0.1 DKIM_VALID             Message has at least one valid DKIM or DK signature
-0.1 DKIM_VALID_AU          Message has a valid DKIM or DK signature from
                            author's domain
 0.1 DKIM_SIGNED            Message has a DKIM or DK signature, not necessarily
                            valid
-0.0 RCVD_IN_MSPIKE_WL      Mailspike good senders
 1.0 URI_PHISH              Phishing using web form

Dissapointingly the result was very different and the score very low. We could still see the same impacting rules though (HTML_IMAGE_ONLY_28 and URI_PHISH) but with lower scores.

I also tried with the -Lt options which means "local-only test" (no calls to remote servers, online blacklists, etc.) and in that case there's fewer tests as expected but it increases the score of others:

> spamassassin -Lt < confirmation-instructions.eml
# ...
Content analysis details:   (3.8 points, 5.0 required)

 pts rule name              description
---- ---------------------- --------------------------------------------------
 0.0 HTML_MESSAGE           BODY: HTML included in message
 2.8 HTML_IMAGE_ONLY_28     BODY: HTML: images with 2400-2800 bytes of
                            words
 1.0 URI_PHISH              Phishing using web form

This is likely to make up for the fact that there's less signal to be used so they need to amplify the only available signals in order to reach the spam score threshold of 5 earlier, I guess.

If you do/can have a local DNS resolver, I would recommend making sure you enable network rules for more reliable results. If using spampd, this is configured with LOCALONLY=0 in /etc/default/spampd

So even though the scores were lower, I knew they could be multiplied for some reasons and also by configuration so better see if I can avoid the email being flagged as HTML_IMAGE_ONLY_28 and URI_PHISH entirely to eliminate the problem.

I first wrote a quick way to test these emails spam scores in my specs (using spamd the daemon version of SpamAssassin, and spamc the command-line client). In order to be able to iterate and test changes quickly, but also to avoid regressions in the future. Future changes of my emails or future versions of SpamAssassin:

def spam_check email
  # Using spamc/spamd (daemon) if available, much faster
  cmd = "spamc --full --connect-retries=1"
  # Using spamassassin (standalone cmd), slower but supports local-only option
  # cmd = "spamassassin -Lt"
  # Inject Received header to trigger more rules like __VIA_ML (return-path contains "bounces@")
  stdin = "Received: by mta212a-ord.mtasv.net id h6qj0s27tk4a for <#{email.to.first}>; #{email.date.rfc2822} (envelope-from <pm_bounces@bounce.updown.io>)\n" + email.to_s
  out, err, status = Open3.capture3(cmd, stdin_data: stdin)
  if out == "0/0\n"
    skip "spamd is not running: `sudo systemctl start spamassassin.service`"
  elsif status.success?
    # minor processing to have stable rules orders and remove width limit
    headers, rules = out.chomp.split("--\n")
    rules.gsub!(/\n\s{5,}/m, " ")
    return headers + "\n" + rules.lines.sort.join
  else
    raise Error.new("Command `#{cmd}` exited with status #{status.to_i}: #{err}")
  end
rescue Errno::ENOENT => e
  skip "SpamAssassin not installed: #{e.to_s}"
end

require "rails_helper"

describe UserMailer do # Devise inherited mailer
  let(:user) { create :user }
  let(:email) { ActionMailer::Base.deliveries.last }

  describe '#confirmation_instructions' do
    subject { user }

    it "passes spam check" do
      subject
      expect(spam_check(email)).to include(<<~REPORT)
        Content analysis details:   (0.0 points, 5.0 required)

         pts rule name              description
        ---- ---------------------- ------------------------------------------------
         0.0 HTML_IMAGE_ONLY_32     BODY: HTML: images with 2800-3200 bytes of words
         0.0 HTML_MESSAGE           BODY: HTML included in message
        -0.0 NO_RELAYS              Informational: message was not relayed via SMTP
      REPORT
    end
  end
end

Now let's have a look at these two rules. It's hard to find clear definitions sometimes but fortunately SpamAssassin is open source so where there is a will there's a way.

HTML_IMAGE_ONLY_24

This one is the easiest and the most self-explanatory, it simply checks if the email contains an image (it does, the updown.io logo) and if the content is between 2000 and 2400 bytes. So basically if the email is short and has an image, it's more likely to be spam (this is because of spam email which hide text as images to avoid filters). Only two options here:
1. Remove the image
2. Increase the content length

I choose the later to keep a consistent look and also because of the second rule. In the end I only increased it a bit and now it matches the HTML_IMAGE_ONLY_32 rule, this rule scores 2.2 in local-only testing but 0 (surprisingly) when network test are enabled. (If we follow the same logic as HTML_IMAGE_ONLY_24, it should have been 2.2/4 ≃ 0.55)

Getting rid of this rule would require much more text bloat or cheating (invisible text, etc..) and it matches more of my emails, so for the moment I decided to leave it like that and wait for the next problem. 2.2 is not enough on it's own to trip the spam threshold (5) and hopefully spamassassin will improve this part before I need to hack around it.

URI_PHISH

Now for the most interesting part, after some online search I first found this which seems to be a plugin checking for URL against a blacklist, but it gives the URI_PHISHING rule (not exactly the same) and I didn't install any plugin, so this is not the one.

I then found this very interesting report in 2021 about a similar confirmation email receiving a "false positive" classification as URI_PHISH, and the official answer was:

It's not based on "phishing URLs" or the specific link, it's based on having body text that looks like account phishing and having a URL. The body text that looks suspiciously like phishing is, unsurprisingly, "confirm your account".

As Loren said, this is not a FP, as the total score for the message did not exceed the spam threshold. This is a single-rule hit on spammy-looking content without other signs to support it. That happens.

It is not a bug that a given rule will hit some ham. The only suggestion I can offer is that you reword your message to make it look less like phishing.

So let's skip over the fact that it is now very sad that anti-spam filters have to block any simple confirmation email just because scamers are successfully abusing people with them...

That piqued my curiousity: what are they looking for in the email exactly? how can I make sure that the change I make won't be matched by another rule or in the future? (yes we unfortunately have to think like scammers now in order to get our regular email accepted...)

So by searching for URI_PHISH into the code I ended up in this big rules file which does contain this (extract slightly simplified):

meta        __URI_PHISH    __HAS_ANY_URI && !__URI_GOOGLE_DOC && !__URI_GOOG_STO_HTML && (__EMAIL_PHISH || __ACCT_PHISH)
meta        URI_PHISH      __URI_PHISH && !ALL_TRUSTED && !__UNSUB_LINK && !__TAG_EXISTS_CENTER && !__HAS_SENDER && !__CAN_HELP && !__VIA_ML && !__UPPERCASE_URI && !__HAS_CC && !__NUMBERS_IN_SUBJ && !__PCT_FOR_YOU && !__MOZILLA_MSGID && !__FB_COST && !__hk_bigmoney && !__REMOTE_IMAGE && !__HELO_HIGHPROFILE && !__RCD_RDNS_SMTP_MESSY && !__BUGGED_IMG && !__FB_TOUR && !__RCVD_DOTGOV_EXT 
describe    URI_PHISH            Phishing using web form
score       URI_PHISH            4.00   # limit

Ok so we now have an entry point which contains MANY other rules of course (some of which also contains other rules). I checked ALL of them for you ^^ and here are my most interesting findings:

First in the positive rules, which needs to be true:

__HAS_ANY_URI → simple regexp on /^\w+:\/\//
__EMAIL_PHISH || __ACCT_PHISH → these the sub rules where the main "phishing" heuristics happens
- __WEBMAIL_ACCT, __MAILBOX_FULL, __MAILBOX_FULL_SE, __CLEAN_MAILBOX, __VALIDATE_MAILBOX, __VALIDATE_MBOX_SE, __UPGR_MAILBOX, __LOCK_MAILBOX, __SYSADMIN, __ATTN_MAIL_USER, __MAIL_ACCT_ACCESS1, __MAIL_ACCT_ACCESS2, __ACCESS_REVOKE, __PASSWORD_UPGRADE, __PENDING_MESSAGES, __RELEASE_MESSAGES, __PASSWORD_EXP_CLUMSY → these are all regexps for typical email scams (mailbox full, click here to regain access to your account, etc...), nothing matching in my email.
- __PDS_FROM_NAME_TO_DOMAIN ⚠️ this one is interesting, it triggers if the From name is equal to the To domain (for example if the emails is From “example.com” To "adrien@example.com"). → this is because many scam use that to make it look like the email comes from your "domain administrator". It wasn't the case for me here, but make sure you don't do that.
- __VERIFY_ACCOUNT → ✅ this is the one matching our email so I had to change the wording to avoid it. The regexp is: /(?:confirm|updated?|verif(?:y|ied)) (?:your|the) (?:(?:account|current|billing|personal|online)? ?(?:records?|information|account|identity|access|data|login)|"?[^\@\s]+\@\S+"? (?:account|mail ?box)|confirm verification|verify k?now|Ihre Angaben .berpr.ft und best.tigt)/i
- __FAILED_LOGINS, __ACCOUNT_REACTIV, __SECURITY_DEPT, __ACCOUNT_ERROR, __ACCOUNT_DISRUPT, __ACCOUNT_UPGRADE, __ACCOUNT_SECURE, __SUSPICION_LOGIN, __ACCESS_SUSPENDED, __ACCESS_RESTORE, __ACCESS_REVOKE → another set of regexp for classic account scams based on fear, I made sure my "account locked" email does not match any of those.

Now let's look at all the negative rules here (starting with a `!`) which are meant to exclude content (if this rule is true, then the `URI_PHISH` rule will NOT apply):

!__URI_GOOGLE_DOC and !__URI_GOOG_STO_HTML → regexp on docs\.google\.com and storage\.googleapis\.com, they got their own special rule so are excluded here.
!ALL_TRUSTED → this is for when you configure some internal email servers as "trusted", not applicable here
!__UNSUB_LINK → ⚠️ Also interesting, this one tries to match unsubscribe links with /\b(?:(?:un)?subscri(?:ber?|ptions?)|abuses?|opt(?:ing)?.?out)\b/i. This is good to know that simply having an unsubscribe link could prevent URI_PHISH, but unfortunately for an account confirmation email you can't really "unsubscribe" people, this is not a mailing list or on-boarding email. Otherwise this would have a been a good option to improve both the spam score and the user experience.
!__VIA_ML → this rule checks if the envelope-from/return-path contains "bounces@" to declare this is a "Mailing List". In my case using Postmark this is the case and cannot be customized unfortunately (only the domain: pm_bounces@bounce.updown.io). So I guess you should avoid using "bounces@" in your return path addresses for transactional emails if you can...

And now let's have a look at my favorites: the totally WTF rules 😱:

!__TAG_EXISTS_CENTER → this rule just checks for the presence of a <center> tag. So if you add one, magically your email is no longer URI_PHISH... WAIT, WHAT? Surely if your email is centered the old way, then it's not phishing (tested locally).
!__HAS_SENDER → if you add ANY Sender header, the URI_PHISH rule is skipped… The goal of the Sender header is for services sending emails on behalf of other users, it helps for authentication validations. But anybody can put anything in here, so there's no reason to consider an email "less phishing" because it contains this header. (tested locally)
!__CAN_HELP → even simpler, this will skip the rule if the email contains "can help"... (tested locally)
!__UPPERCASE_URI → pretty self-explanatory
!__HAS_CC → what? why?
!__NUMBERS_IN_SUBJ → OK so more than 3 digits in subject line also helps... /\d{3}/
!__FB_COST → this is one checks for the word... "cost". Yep, just that. Put it in an email and suddenly it's not phishing... (tested locally)
!__FB_TOUR → similarly this one checks for the word "tour"...

It's likely that some of these rules are only here to replace URI_PHISH by another one more specific maybe (like we saw the case with Google Doc URLs), but still in this state it's pretty easy to exploit them and in my testing locally, using those words to trigger those rules didn't cause other spam rules to appear...

Which means that in the end we have a spam filter which is very easy to fool, yet easily tripped by honest emails...

What I changed in the end

I tried changing the return-path to avoid "bounces@" but couldn't do it with Postmark unfortunately.
I did not want to ~use~ exploit any of the stupid hacks like "cost" or "<center>".
I changed the wording of the email to make it longer and avoid the common word combinations matched in the regexp (see screenshot below for the new version)
I also added a Sender header (only for some emails and with the same value as From) in order to please the rules because this one doesn't look too hackish, but I still don't feel great about this 🙃.

New email

new email screenshot

18 Like

Adrien Rey-Jarthon

Created on December 04, 2023