From: Rusi <rustompmody@gmail.com>
To: help-gnu-emacs@gnu.org
Subject: Re: Spam through the newsgroup gateway
Date: Thu, 15 Nov 2018 19:38:19 -0800 (PST) [thread overview]
Message-ID: <d5fd832c-630f-4693-af70-29dbd9f0f29b@googlegroups.com> (raw)
In-Reply-To: <mailman.3846.1541888586.1284.help-gnu-emacs@gnu.org>
On Sunday, November 11, 2018 at 3:53:08 AM UTC+5:30, Bob Proulx wrote:
> Alexandre Garreau wrote:
> > Bob Proulx wrote:
> > > Public Service Announcement: Please do not reply to spam. If a valid
> > > message is in reply to a spam message then it refers to it and in a
> > > sense validates it. To talk about spam please use an independent
> > > thread so as not to validate the original spam.
> >
> > Why so?
>
> The best anti-spam engines in practice are learning engines such as
> Bayes and other. Spam characteristics change so quickly and their
> human senders keep trying to be more sneaky than before. We use no
> fewer than three! SpamAssassin, Bogofilter, and CRM114. By far
> CRM114 is the best of those three. But there are subtle differences
> that keep me playing one off the other and therefore continuing to add
> engines rather than remove them.
>
> Since they are learning engines they must be trained in order to
> learn. The best training has been training on error. When the
> classification is different it must be corrected.
>
> All messages are fed through the anti-spam classification engines
> twice. Once on the frontend in order to classify the message to
> determine if it should be automatically discarded. And then once
> again after the messages go through the mailing list to train on any
> errors. Since the mailing lists are relatively spam free (IMNHO) then
> I assume that any message through the mailing list is a desired
> message. If any of the learning engines think otherwise then it
> triggers training to learn that message as non-spam.
>
> SpamAssassin knows the structure of email, what's a header and what is
> the body. Bogofilter and CRM114 have no knowledge of email structure
> and process the message as a raw file looking at tokens in the headers
> and structure and learning them as either indicators or not
> dynamically. For them this includes IP addresses and email addresses
> and everything. Everything is open to gripping upon.
>
> Just recently, due to our conversations about the newsgroup gateway
> here, I have modified this algorithm slightly. I now look for the
> newsgroup gateway header. If a message entered through the newsgroup
> then I ignore it. There isn't anything I can do about it. Training
> on it makes no sense. Therefore I ignore it. No training. But until
> recently I did train on newsgroup messages too.
>
> If someone replies to the message then the email headers and the
> structure of it and, goodness forbid if they quote any of the message
> (top posting on the entire spam is worst), then all of that may have
> been associated with spam but when it comes through the mailing list
> now it will be associated with non-spam. Training the learning
> engines on it will pull the database to thinking that that type of
> message, spam though it is, is desirable on the mailing list and will
> pass it through in the future. It will eventually correct but may
> take a while. A while being around a month for the size of the token
> database we keep. From week to week the trend in spam changes.
>
> > If not sending anything to whoever sent the mail, will they
> > track the mailing-list or its archive to find some other mail referring
> > to it, and take this as an encouragement and post more spam?
>
> Not likely. I think for spammers it is mostly send and forget (like a
> "fire and forget" military missile).
>
> > Otherwise, what's the problem of validation if it's for a single spam?
> > Let's say someone got their antispam block that spam: it seems to me
> > normal, whenever a discussion is being about some spam that has been
> > relayed by the list, that the user either see the aforementioned spam,
> > to aknowledge the problem other are living (and get a sample of it), or
> > not to see the thread at all, as they're not concerned.
>
> If it is a single spam it isn't the end of the world. It is all just
> incremental. Because it will be used to train the learning engines.
> And they will recover given enough time and good later input. But
> every little bit counts!
>
> > Ideally there should be a way to trigger metadata so that when you
> > answer to something you do while marking it as spam for people seeing
> > your message, like a mail header for it.
>
> There are systems in use where the community can vote upon messages.
> They usually require multiple votes, say five, from known quality
> voters, and then the message is hidden. But mostly we see those with
> web page forums. Since this is a mailing list in order to install
> such a thing we would need to have users trained on how to do this.
>
> As another data point in this area the Debian mailing lists have an
> address where people can "bounce" the spam to for further training of
> their anti-spam learning engines. And as a notification to the
> listmaster that spam is flowing in and needs help to be blocked (they
> use procmail rules, we do too) if they get a new type that slips
> through. (Mutt has a 'b'ounce mail action, other mailers may or may
> not.) We could set up something like that but one does not exist at
> the moment. With some more work it could be useful if people were to
> contribute spams that slip through into the mailing list to it.
>
> Sorry for the long delay in answering this message. Life and time is
> what keeps everything from happening all at once.
>
> Bob
You seem to be managing a splendid job with ML-news gateway spam
[ Compare https://groups.google.com/forum/#!forum/comp.lang.python ]
Wonder how easy it would be for you to share your know-how in capsule/summary??
(assuming the folks managing comp.lang.python are interested]
next prev parent reply other threads:[~2018-11-16 3:38 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-10-25 19:13 Spam through the newsgroup gateway Bob Proulx
2018-10-27 17:01 ` Garreau, Alexandre
2018-11-10 22:17 ` Bob Proulx
[not found] ` <mailman.3846.1541888586.1284.help-gnu-emacs@gnu.org>
2018-11-16 3:38 ` Rusi [this message]
[not found] <mailman.2742.1540494841.1284.help-gnu-emacs@gnu.org>
2018-10-25 20:30 ` Nuno Silva
2018-10-25 20:41 ` Bob Proulx
2018-10-25 20:57 ` Emanuel Berg
2018-10-25 22:06 ` Van L
2018-10-26 10:57 ` Emanuel Berg
2018-10-25 20:48 ` Emanuel Berg
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/emacs/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d5fd832c-630f-4693-af70-29dbd9f0f29b@googlegroups.com \
--to=rustompmody@gmail.com \
--cc=help-gnu-emacs@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).