From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Bob Proulx Newsgroups: gmane.emacs.help Subject: Re: Spam through the newsgroup gateway Date: Sat, 10 Nov 2018 15:17:25 -0700 Message-ID: <20181110085658333248079@bob.proulx.com> References: <20181025130739756133190@bob.proulx.com> <87o9bfiasg.fsf@portable.galex-713.eu> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: blaine.gmane.org 1541888599 3363 195.159.176.226 (10 Nov 2018 22:23:19 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sat, 10 Nov 2018 22:23:19 +0000 (UTC) User-Agent: Mutt/1.10.1 (2018-07-13) Cc: "Garreau, Alexandre" To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Sat Nov 10 23:23:15 2018 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gLbet-0000l8-1i for geh-help-gnu-emacs@m.gmane.org; Sat, 10 Nov 2018 23:23:15 +0100 Original-Received: from localhost ([::1]:40160 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gLbgz-0006DK-0j for geh-help-gnu-emacs@m.gmane.org; Sat, 10 Nov 2018 17:25:25 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:42439) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gLbef-0002Yi-NS for help-gnu-emacs@gnu.org; Sat, 10 Nov 2018 17:23:04 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gLbZH-0004nw-Ej for help-gnu-emacs@gnu.org; Sat, 10 Nov 2018 17:17:31 -0500 Original-Received: from havoc.proulx.com ([96.88.95.61]:47281) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1gLbZH-0004mX-5v for help-gnu-emacs@gnu.org; Sat, 10 Nov 2018 17:17:27 -0500 Original-Received: from joseki.proulx.com (localhost [127.0.0.1]) by havoc.proulx.com (Postfix) with ESMTP id 867C316FF; Sat, 10 Nov 2018 15:17:25 -0700 (MST) Original-Received: from hysteria.proulx.com (hysteria.proulx.com [192.168.230.119]) by joseki.proulx.com (Postfix) with ESMTP id 41E3F217ED; Sat, 10 Nov 2018 15:17:25 -0700 (MST) Original-Received: by hysteria.proulx.com (Postfix, from userid 1000) id 268FA2DC75; Sat, 10 Nov 2018 15:17:25 -0700 (MST) Mail-Followup-To: help-gnu-emacs@gnu.org, "Garreau, Alexandre" Content-Disposition: inline In-Reply-To: <87o9bfiasg.fsf@portable.galex-713.eu> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 96.88.95.61 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "help-gnu-emacs" Xref: news.gmane.org gmane.emacs.help:118640 Archived-At: Alexandre Garreau wrote: > Bob Proulx wrote: > > Public Service Announcement: Please do not reply to spam. If a valid > > message is in reply to a spam message then it refers to it and in a > > sense validates it. To talk about spam please use an independent > > thread so as not to validate the original spam. > > Why so? The best anti-spam engines in practice are learning engines such as Bayes and other. Spam characteristics change so quickly and their human senders keep trying to be more sneaky than before. We use no fewer than three! SpamAssassin, Bogofilter, and CRM114. By far CRM114 is the best of those three. But there are subtle differences that keep me playing one off the other and therefore continuing to add engines rather than remove them. Since they are learning engines they must be trained in order to learn. The best training has been training on error. When the classification is different it must be corrected. All messages are fed through the anti-spam classification engines twice. Once on the frontend in order to classify the message to determine if it should be automatically discarded. And then once again after the messages go through the mailing list to train on any errors. Since the mailing lists are relatively spam free (IMNHO) then I assume that any message through the mailing list is a desired message. If any of the learning engines think otherwise then it triggers training to learn that message as non-spam. SpamAssassin knows the structure of email, what's a header and what is the body. Bogofilter and CRM114 have no knowledge of email structure and process the message as a raw file looking at tokens in the headers and structure and learning them as either indicators or not dynamically. For them this includes IP addresses and email addresses and everything. Everything is open to gripping upon. Just recently, due to our conversations about the newsgroup gateway here, I have modified this algorithm slightly. I now look for the newsgroup gateway header. If a message entered through the newsgroup then I ignore it. There isn't anything I can do about it. Training on it makes no sense. Therefore I ignore it. No training. But until recently I did train on newsgroup messages too. If someone replies to the message then the email headers and the structure of it and, goodness forbid if they quote any of the message (top posting on the entire spam is worst), then all of that may have been associated with spam but when it comes through the mailing list now it will be associated with non-spam. Training the learning engines on it will pull the database to thinking that that type of message, spam though it is, is desirable on the mailing list and will pass it through in the future. It will eventually correct but may take a while. A while being around a month for the size of the token database we keep. From week to week the trend in spam changes. > If not sending anything to whoever sent the mail, will they > track the mailing-list or its archive to find some other mail referring > to it, and take this as an encouragement and post more spam? Not likely. I think for spammers it is mostly send and forget (like a "fire and forget" military missile). > Otherwise, what's the problem of validation if it's for a single spam? > Let's say someone got their antispam block that spam: it seems to me > normal, whenever a discussion is being about some spam that has been > relayed by the list, that the user either see the aforementioned spam, > to aknowledge the problem other are living (and get a sample of it), or > not to see the thread at all, as they're not concerned. If it is a single spam it isn't the end of the world. It is all just incremental. Because it will be used to train the learning engines. And they will recover given enough time and good later input. But every little bit counts! > Ideally there should be a way to trigger metadata so that when you > answer to something you do while marking it as spam for people seeing > your message, like a mail header for it. There are systems in use where the community can vote upon messages. They usually require multiple votes, say five, from known quality voters, and then the message is hidden. But mostly we see those with web page forums. Since this is a mailing list in order to install such a thing we would need to have users trained on how to do this. As another data point in this area the Debian mailing lists have an address where people can "bounce" the spam to for further training of their anti-spam learning engines. And as a notification to the listmaster that spam is flowing in and needs help to be blocked (they use procmail rules, we do too) if they get a new type that slips through. (Mutt has a 'b'ounce mail action, other mailers may or may not.) We could set up something like that but one does not exist at the moment. With some more work it could be useful if people were to contribute spams that slip through into the mailing list to it. Sorry for the long delay in answering this message. Life and time is what keeps everything from happening all at once. Bob