From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!.POSTED!not-for-mail
From: Bob Proulx <bob@proulx.com>
Newsgroups: gmane.emacs.help
Subject: Re: Spam through the newsgroup gateway
Date: Sat, 10 Nov 2018 15:17:25 -0700
Message-ID: <20181110085658333248079@bob.proulx.com>
References: <20181025130739756133190@bob.proulx.com>
	<87o9bfiasg.fsf@portable.galex-713.eu>
NNTP-Posting-Host: blaine.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: blaine.gmane.org 1541888599 3363 195.159.176.226 (10 Nov 2018 22:23:19 GMT)
X-Complaints-To: usenet@blaine.gmane.org
NNTP-Posting-Date: Sat, 10 Nov 2018 22:23:19 +0000 (UTC)
User-Agent: Mutt/1.10.1 (2018-07-13)
Cc: "Garreau, Alexandre" <galex-713@galex-713.eu>
To: help-gnu-emacs@gnu.org
Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Sat Nov 10 23:23:15 2018
Return-path: <help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org>
Envelope-to: geh-help-gnu-emacs@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by blaine.gmane.org with esmtp (Exim 4.84_2)
	(envelope-from <help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org>)
	id 1gLbet-0000l8-1i
	for geh-help-gnu-emacs@m.gmane.org; Sat, 10 Nov 2018 23:23:15 +0100
Original-Received: from localhost ([::1]:40160 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org>)
	id 1gLbgz-0006DK-0j
	for geh-help-gnu-emacs@m.gmane.org; Sat, 10 Nov 2018 17:25:25 -0500
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:42439)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <bob@proulx.com>) id 1gLbef-0002Yi-NS
	for help-gnu-emacs@gnu.org; Sat, 10 Nov 2018 17:23:04 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <bob@proulx.com>) id 1gLbZH-0004nw-Ej
	for help-gnu-emacs@gnu.org; Sat, 10 Nov 2018 17:17:31 -0500
Original-Received: from havoc.proulx.com ([96.88.95.61]:47281)
	by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
	(Exim 4.71) (envelope-from <bob@proulx.com>) id 1gLbZH-0004mX-5v
	for help-gnu-emacs@gnu.org; Sat, 10 Nov 2018 17:17:27 -0500
Original-Received: from joseki.proulx.com (localhost [127.0.0.1])
	by havoc.proulx.com (Postfix) with ESMTP id 867C316FF;
	Sat, 10 Nov 2018 15:17:25 -0700 (MST)
Original-Received: from hysteria.proulx.com (hysteria.proulx.com [192.168.230.119])
	by joseki.proulx.com (Postfix) with ESMTP id 41E3F217ED;
	Sat, 10 Nov 2018 15:17:25 -0700 (MST)
Original-Received: by hysteria.proulx.com (Postfix, from userid 1000)
	id 268FA2DC75; Sat, 10 Nov 2018 15:17:25 -0700 (MST)
Mail-Followup-To: help-gnu-emacs@gnu.org,
	"Garreau, Alexandre" <galex-713@galex-713.eu>
Content-Disposition: inline
In-Reply-To: <87o9bfiasg.fsf@portable.galex-713.eu>
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
	[fuzzy]
X-Received-From: 96.88.95.61
X-BeenThere: help-gnu-emacs@gnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: Users list for the GNU Emacs text editor <help-gnu-emacs.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/help-gnu-emacs>,
	<mailto:help-gnu-emacs-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/help-gnu-emacs/>
List-Post: <mailto:help-gnu-emacs@gnu.org>
List-Help: <mailto:help-gnu-emacs-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/help-gnu-emacs>,
	<mailto:help-gnu-emacs-request@gnu.org?subject=subscribe>
Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org
Original-Sender: "help-gnu-emacs"
	<help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org>
Xref: news.gmane.org gmane.emacs.help:118640
Archived-At: <http://permalink.gmane.org/gmane.emacs.help/118640>

Alexandre Garreau wrote:
> Bob Proulx wrote:
> > Public Service Announcement: Please do not reply to spam.  If a valid
> > message is in reply to a spam message then it refers to it and in a
> > sense validates it.  To talk about spam please use an independent
> > thread so as not to validate the original spam.
>
> Why so?

The best anti-spam engines in practice are learning engines such as
Bayes and other.  Spam characteristics change so quickly and their
human senders keep trying to be more sneaky than before.  We use no
fewer than three!  SpamAssassin, Bogofilter, and CRM114.  By far
CRM114 is the best of those three.  But there are subtle differences
that keep me playing one off the other and therefore continuing to add
engines rather than remove them.

Since they are learning engines they must be trained in order to
learn.  The best training has been training on error.  When the
classification is different it must be corrected.

All messages are fed through the anti-spam classification engines
twice.  Once on the frontend in order to classify the message to
determine if it should be automatically discarded.  And then once
again after the messages go through the mailing list to train on any
errors.  Since the mailing lists are relatively spam free (IMNHO) then
I assume that any message through the mailing list is a desired
message.  If any of the learning engines think otherwise then it
triggers training to learn that message as non-spam.

SpamAssassin knows the structure of email, what's a header and what is
the body.  Bogofilter and CRM114 have no knowledge of email structure
and process the message as a raw file looking at tokens in the headers
and structure and learning them as either indicators or not
dynamically.  For them this includes IP addresses and email addresses
and everything.  Everything is open to gripping upon.

Just recently, due to our conversations about the newsgroup gateway
here, I have modified this algorithm slightly.  I now look for the
newsgroup gateway header.  If a message entered through the newsgroup
then I ignore it.  There isn't anything I can do about it.  Training
on it makes no sense.  Therefore I ignore it.  No training.  But until
recently I did train on newsgroup messages too.

If someone replies to the message then the email headers and the
structure of it and, goodness forbid if they quote any of the message
(top posting on the entire spam is worst), then all of that may have
been associated with spam but when it comes through the mailing list
now it will be associated with non-spam.  Training the learning
engines on it will pull the database to thinking that that type of
message, spam though it is, is desirable on the mailing list and will
pass it through in the future.  It will eventually correct but may
take a while.  A while being around a month for the size of the token
database we keep.  From week to week the trend in spam changes.

> If not sending anything to whoever sent the mail, will they
> track the mailing-list or its archive to find some other mail referring
> to it, and take this as an encouragement and post more spam?

Not likely.  I think for spammers it is mostly send and forget (like a
"fire and forget" military missile).

> Otherwise, what's the problem of validation if it's for a single spam?
> Let's say someone got their antispam block that spam: it seems to me
> normal, whenever a discussion is being about some spam that has been
> relayed by the list, that the user either see the aforementioned spam,
> to aknowledge the problem other are living (and get a sample of it), or
> not to see the thread at all, as they're not concerned.

If it is a single spam it isn't the end of the world.  It is all just
incremental.  Because it will be used to train the learning engines.
And they will recover given enough time and good later input.  But
every little bit counts!

> Ideally there should be a way to trigger metadata so that when you
> answer to something you do while marking it as spam for people seeing
> your message, like a mail header for it.

There are systems in use where the community can vote upon messages.
They usually require multiple votes, say five, from known quality
voters, and then the message is hidden.  But mostly we see those with
web page forums.  Since this is a mailing list in order to install
such a thing we would need to have users trained on how to do this.

As another data point in this area the Debian mailing lists have an
address where people can "bounce" the spam to for further training of
their anti-spam learning engines.  And as a notification to the
listmaster that spam is flowing in and needs help to be blocked (they
use procmail rules, we do too) if they get a new type that slips
through.  (Mutt has a 'b'ounce mail action, other mailers may or may
not.)  We could set up something like that but one does not exist at
the moment.  With some more work it could be useful if people were to
contribute spams that slip through into the mailing list to it.

Sorry for the long delay in answering this message.  Life and time is
what keeps everything from happening all at once.

Bob