Analysis and blocking of structured spam headers
Spam, like the tide, ebbs and flows. Whatever spam filtering tool one is using, there will come a time when new spam messages start sneaking past it. Usually these are temporary setbacks, as the filtering tools are taught to recover from their errors. Sometimes, though, the spam is irritating enough to rate a closer look.
In this post, I'll describe my investigation of a recent surge in spam messages, the common header traits that I found which helped indicate this particular family of spam, culminating in a one-off filter I wrote to mark them as spam until my main filters catch up.
In this post, I'll describe my investigation of a recent surge in spam messages, the common header traits that I found which helped indicate this particular family of spam, culminating in a one-off filter I wrote to mark them as spam until my main filters catch up.
The spam messages are obvious to a human:

But, obviously, my spam filter (bogofilter) was having trouble detecting it. I decided to examine the headers to see if there were any traits that marked these messages out:
Now, just looking at those headers, one of them stands out. Even if you don't have much experience reading SMTP headers, one of the header names - the bit on the left before the : - is going to stand out... which one do you think?
.
.
.
That's right! ! I’ve never heard of the “Fearful:” SMTP header either. And perusing several other emails, I began to notice a number of headers whose key was a normal English word and whose value was somewhat predictably formatted. Here’s the example from 5 separate messages:
This header clearly has a format of “number1”, 32-characters of hexadecimal, “number2” where each of those tokens is separated by “-“, “.”, or “_”. And just a little more perusal shows that the “number1” and “number2” tokens are also used as part of the MIME boundary. Notice how '8296745' and '1701287' show up across both lines here:
So there’s a useful test right there. “If you have a MIME boundary= of (\d+)_(\d+)_(\d+), and you find another header with the format (\d+)[_-.]([0-9a-e]{32})[_-.](\d+) where you can match (\d+) with the MIME boundary=, then you have spam!” Easy, right?
Which is fine, but of course, before I go do that, it’s odd to me that the same hexadecimal value shows up in all these separate message. What is 31be7921b5256c0654c1658b43ac9faa and why do the spammers keep using it? Well, it’s clearly compatible with an MD5 hash. What data could they be hashing that would be the same across all these messages they’re sending me? Well… now that you mention it…
Oh, right. My email address.
Grepping through the spam that I had on hand, I find that it’s included in the Message-ID header as well:
And, examining that, I find there are two variants:
“Message-ID has hash of recipient address and two words, all separated by ‘.’”
And
“Message-ID has hash of recipient address and two number fields, all separated by ‘.’”
Oddly enough, those “two number fields” have the same property of showing up in the MIME boundary for those messages which don’t have a “weird word” header. And those Message-IDs with “weird words” reliably have a “weird word” header which contains the hash and those two number fields. (I do not see any connection between the “weird words” in the Message-ID header and the eponymous “weird word” header.)
Writing an actual mail filter that looks for these interconnected headers is reasonably trivial. The version I’m using now is lazy, and simply looks for the presence of the hashed delivery address (culled from X-Original-To) in the Message-ID header. But it would be possible to more rigorously search for the two variants we’ve seen:
And

But, obviously, my spam filter (bogofilter) was having trouble detecting it. I decided to examine the headers to see if there were any traits that marked these messages out:
From CheapFlightDeals@yuoi2as.johutch.top Tue Aug 2 14:42:16 2016
Return-Path:
X-Original-To: gowen@swynwyr.com
Delivered-To: gowen@swynwyr.com
Received: from yuoi2as.johutch.top (unknown [205.209.137.204])
by bifrost.swynwyr.com (Postfix) with ESMTP id D5D621F561
for ; Tue, 2 Aug 2016 14:42:16 +0000 (UTC)
Date: Tue, 02 Aug 2016 07:54:03 -0700
Subject: Summer Specials on Cheap Flights - Don't Delay
Content-Type: multipart/alternative; boundary="8296745_1701287_8296745"
Fearful: 8296745-31be7921b5256c0654c1658b43ac9faa.1701287
To:
From: Cheap Flight Deals
Mime-Version: 1.0
Message-ID: <31be7921b5256c0654c1658b43ac9faa.Stoccado.Especially.gowen@swynwyr.com>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Now, just looking at those headers, one of them stands out. Even if you don't have much experience reading SMTP headers, one of the header names - the bit on the left before the : - is going to stand out... which one do you think?
.
.
.
That's right! ! I’ve never heard of the “Fearful:” SMTP header either. And perusing several other emails, I began to notice a number of headers whose key was a normal English word and whose value was somewhat predictably formatted. Here’s the example from 5 separate messages:
Fearful: 8296745-31be7921b5256c0654c1658b43ac9faa.1701287
Orient: 18474119-31be7921b5256c0654c1658b43ac9faa_22508371
Chondrite: 6405762.31be7921b5256c0654c1658b43ac9faa.21720535
Edelweiss: 21641759.31be7921b5256c0654c1658b43ac9faa.6837179
Femininity: 18284332-31be7921b5256c0654c1658b43ac9faa-17293339
This header clearly has a format of “number1”, 32-characters of hexadecimal, “number2” where each of those tokens is separated by “-“, “.”, or “_”. And just a little more perusal shows that the “number1” and “number2” tokens are also used as part of the MIME boundary. Notice how '8296745' and '1701287' show up across both lines here:
Content-Type: multipart/alternative; boundary="8296745_1701287_8296745"
Fearful: 8296745-31be7921b5256c0654c1658b43ac9faa.1701287
So there’s a useful test right there. “If you have a MIME boundary= of (\d+)_(\d+)_(\d+), and you find another header with the format (\d+)[_-.]([0-9a-e]{32})[_-.](\d+) where you can match (\d+) with the MIME boundary=, then you have spam!” Easy, right?
Which is fine, but of course, before I go do that, it’s odd to me that the same hexadecimal value shows up in all these separate message. What is 31be7921b5256c0654c1658b43ac9faa and why do the spammers keep using it? Well, it’s clearly compatible with an MD5 hash. What data could they be hashing that would be the same across all these messages they’re sending me? Well… now that you mention it…
X-Original-To: gowen@swynwyr.com
Oh, right. My email address.
$ echo -n gowen@swynwyr.com | md5sum
31be7921b5256c0654c1658b43ac9faa -
Grepping through the spam that I had on hand, I find that it’s included in the Message-ID header as well:
$ grep -h Message-ID *
Message-ID: <31be7921b5256c0654c1658b43ac9faa.21287638.12663829gowen@swynwyr.com>
Message-ID: <31be7921b5256c0654c1658b43ac9faa.Fearlessness.Gymnast.gowen@swynwyr.com>
Message-ID: <31be7921b5256c0654c1658b43ac9faa.Softly.Barracoon.gowen@swynwyr.com>
Message-ID: <31be7921b5256c0654c1658b43ac9faa.4282924.13413234gowen@swynwyr.com>
Message-ID: <0.0.31be7921b5256c0654c1658b43ac9faa.21641759.6837179.0gowen@swynwyr.com>
Message-ID: <31be7921b5256c0654c1658b43ac9faa.Stoccado.Especially.gowen@swynwyr.com>
Message-ID: <31be7921b5256c0654c1658b43ac9faa.22266702.21517097gowen@swynwyr.com>
Message-ID: <31be7921b5256c0654c1658b43ac9faa.21054252.2424093gowen@swynwyr.com>
And, examining that, I find there are two variants:
“Message-ID has hash of recipient address and two words, all separated by ‘.’”
And
“Message-ID has hash of recipient address and two number fields, all separated by ‘.’”
Oddly enough, those “two number fields” have the same property of showing up in the MIME boundary for those messages which don’t have a “weird word” header. And those Message-IDs with “weird words” reliably have a “weird word” header which contains the hash and those two number fields. (I do not see any connection between the “weird words” in the Message-ID header and the eponymous “weird word” header.)
Writing an actual mail filter that looks for these interconnected headers is reasonably trivial. The version I’m using now is lazy, and simply looks for the presence of the hashed delivery address (culled from X-Original-To) in the Message-ID header. But it would be possible to more rigorously search for the two variants we’ve seen:
X-Original-To: gowen@swynwyr.com
Orient: 18474119-31be7921b5256c0654c1658b43ac9faa_22508371
Message-ID: <31be7921b5256c0654c1658b43ac9faa.Fearlessness.Gymnast.gowen@swynwyr.com>
Content-Type: multipart/alternative; boundary="18474119_22508371_18474119"
And
X-Original-To: gowen@swynwyr.com
Message-ID: <31be7921b5256c0654c1658b43ac9faa.4282924.13413234gowen@swynwyr.com>
Content-Type: multipart/alternative; boundary="4282924_13413234_4282924"
Comments
Display comments as Linear | Threaded