From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!.POSTED!not-for-mail
From: Rusi <rustompmody@gmail.com>
Newsgroups: gmane.emacs.help
Subject: Re: Spam through the newsgroup gateway
Date: Thu, 15 Nov 2018 19:38:19 -0800 (PST)
Message-ID: <d5fd832c-630f-4693-af70-29dbd9f0f29b@googlegroups.com>
References: <20181025130739756133190@bob.proulx.com>
	<87o9bfiasg.fsf@portable.galex-713.eu>
	<mailman.3846.1541888586.1284.help-gnu-emacs@gnu.org>
NNTP-Posting-Host: blaine.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
X-Trace: blaine.gmane.org 1542339512 1914 195.159.176.226 (16 Nov 2018 03:38:32 GMT)
X-Complaints-To: usenet@blaine.gmane.org
NNTP-Posting-Date: Fri, 16 Nov 2018 03:38:32 +0000 (UTC)
Injection-Date: Fri, 16 Nov 2018 03:38:20 +0000
User-Agent: G2/1.0
To: help-gnu-emacs@gnu.org
Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Fri Nov 16 04:38:28 2018
Return-path: <help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org>
Envelope-to: geh-help-gnu-emacs@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by blaine.gmane.org with esmtp (Exim 4.84_2)
	(envelope-from <help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org>)
	id 1gNUxf-0000OD-2X
	for geh-help-gnu-emacs@m.gmane.org; Fri, 16 Nov 2018 04:38:27 +0100
Original-Received: from localhost ([::1]:42009 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org>)
	id 1gNUzl-0007qd-6E
	for geh-help-gnu-emacs@m.gmane.org; Thu, 15 Nov 2018 22:40:37 -0500
X-Received: by 2002:ac8:3870:: with SMTP id r45mr2157843qtb.1.1542339500720;
	Thu, 15 Nov 2018 19:38:20 -0800 (PST)
X-Received: by 2002:a0c:8b4c:: with SMTP id d12mr34169qvc.3.1542339499904;
	Thu, 15 Nov 2018 19:38:19 -0800 (PST)
Original-Path: usenet.stanford.edu!g188-v6no18009itg.0!news-out.google.com!m21ni1207qta.0!nntp.google.com!z5-v6no18738ite.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail
Original-Newsgroups: gnu.emacs.help
In-Reply-To: <mailman.3846.1541888586.1284.help-gnu-emacs@gnu.org>
Complaints-To: groups-abuse@google.com
Original-Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=117.222.3.31;
	posting-account=mBpa7woAAAAGLEWUUKpmbxm-Quu5D8ui
Original-NNTP-Posting-Host: 117.222.3.31
Original-Xref: usenet.stanford.edu gnu.emacs.help:224552
X-BeenThere: help-gnu-emacs@gnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: Users list for the GNU Emacs text editor <help-gnu-emacs.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/help-gnu-emacs>,
	<mailto:help-gnu-emacs-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/help-gnu-emacs/>
List-Post: <mailto:help-gnu-emacs@gnu.org>
List-Help: <mailto:help-gnu-emacs-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/help-gnu-emacs>,
	<mailto:help-gnu-emacs-request@gnu.org?subject=subscribe>
Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org
Original-Sender: "help-gnu-emacs"
	<help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org>
Xref: news.gmane.org gmane.emacs.help:118681
Archived-At: <http://permalink.gmane.org/gmane.emacs.help/118681>

On Sunday, November 11, 2018 at 3:53:08 AM UTC+5:30, Bob Proulx wrote:
> Alexandre Garreau wrote:
> > Bob Proulx wrote:
> > > Public Service Announcement: Please do not reply to spam.  If a valid
> > > message is in reply to a spam message then it refers to it and in a
> > > sense validates it.  To talk about spam please use an independent
> > > thread so as not to validate the original spam.
> >
> > Why so?
> 
> The best anti-spam engines in practice are learning engines such as
> Bayes and other.  Spam characteristics change so quickly and their
> human senders keep trying to be more sneaky than before.  We use no
> fewer than three!  SpamAssassin, Bogofilter, and CRM114.  By far
> CRM114 is the best of those three.  But there are subtle differences
> that keep me playing one off the other and therefore continuing to add
> engines rather than remove them.
> 
> Since they are learning engines they must be trained in order to
> learn.  The best training has been training on error.  When the
> classification is different it must be corrected.
> 
> All messages are fed through the anti-spam classification engines
> twice.  Once on the frontend in order to classify the message to
> determine if it should be automatically discarded.  And then once
> again after the messages go through the mailing list to train on any
> errors.  Since the mailing lists are relatively spam free (IMNHO) then
> I assume that any message through the mailing list is a desired
> message.  If any of the learning engines think otherwise then it
> triggers training to learn that message as non-spam.
> 
> SpamAssassin knows the structure of email, what's a header and what is
> the body.  Bogofilter and CRM114 have no knowledge of email structure
> and process the message as a raw file looking at tokens in the headers
> and structure and learning them as either indicators or not
> dynamically.  For them this includes IP addresses and email addresses
> and everything.  Everything is open to gripping upon.
> 
> Just recently, due to our conversations about the newsgroup gateway
> here, I have modified this algorithm slightly.  I now look for the
> newsgroup gateway header.  If a message entered through the newsgroup
> then I ignore it.  There isn't anything I can do about it.  Training
> on it makes no sense.  Therefore I ignore it.  No training.  But until
> recently I did train on newsgroup messages too.
> 
> If someone replies to the message then the email headers and the
> structure of it and, goodness forbid if they quote any of the message
> (top posting on the entire spam is worst), then all of that may have
> been associated with spam but when it comes through the mailing list
> now it will be associated with non-spam.  Training the learning
> engines on it will pull the database to thinking that that type of
> message, spam though it is, is desirable on the mailing list and will
> pass it through in the future.  It will eventually correct but may
> take a while.  A while being around a month for the size of the token
> database we keep.  From week to week the trend in spam changes.
> 
> > If not sending anything to whoever sent the mail, will they
> > track the mailing-list or its archive to find some other mail referring
> > to it, and take this as an encouragement and post more spam?
> 
> Not likely.  I think for spammers it is mostly send and forget (like a
> "fire and forget" military missile).
> 
> > Otherwise, what's the problem of validation if it's for a single spam?
> > Let's say someone got their antispam block that spam: it seems to me
> > normal, whenever a discussion is being about some spam that has been
> > relayed by the list, that the user either see the aforementioned spam,
> > to aknowledge the problem other are living (and get a sample of it), or
> > not to see the thread at all, as they're not concerned.
> 
> If it is a single spam it isn't the end of the world.  It is all just
> incremental.  Because it will be used to train the learning engines.
> And they will recover given enough time and good later input.  But
> every little bit counts!
> 
> > Ideally there should be a way to trigger metadata so that when you
> > answer to something you do while marking it as spam for people seeing
> > your message, like a mail header for it.
> 
> There are systems in use where the community can vote upon messages.
> They usually require multiple votes, say five, from known quality
> voters, and then the message is hidden.  But mostly we see those with
> web page forums.  Since this is a mailing list in order to install
> such a thing we would need to have users trained on how to do this.
> 
> As another data point in this area the Debian mailing lists have an
> address where people can "bounce" the spam to for further training of
> their anti-spam learning engines.  And as a notification to the
> listmaster that spam is flowing in and needs help to be blocked (they
> use procmail rules, we do too) if they get a new type that slips
> through.  (Mutt has a 'b'ounce mail action, other mailers may or may
> not.)  We could set up something like that but one does not exist at
> the moment.  With some more work it could be useful if people were to
> contribute spams that slip through into the mailing list to it.
> 
> Sorry for the long delay in answering this message.  Life and time is
> what keeps everything from happening all at once.
> 
> Bob

You seem to be managing a splendid job with ML-news gateway spam
[ Compare https://groups.google.com/forum/#!forum/comp.lang.python ]

Wonder how easy it would be for you to share your know-how in capsule/summary??
(assuming the folks managing comp.lang.python are interested]