From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Rusi Newsgroups: gmane.emacs.help Subject: Re: Spam through the newsgroup gateway Date: Thu, 15 Nov 2018 19:38:19 -0800 (PST) Message-ID: References: <20181025130739756133190@bob.proulx.com> <87o9bfiasg.fsf@portable.galex-713.eu> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" X-Trace: blaine.gmane.org 1542339512 1914 195.159.176.226 (16 Nov 2018 03:38:32 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Fri, 16 Nov 2018 03:38:32 +0000 (UTC) Injection-Date: Fri, 16 Nov 2018 03:38:20 +0000 User-Agent: G2/1.0 To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Fri Nov 16 04:38:28 2018 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gNUxf-0000OD-2X for geh-help-gnu-emacs@m.gmane.org; Fri, 16 Nov 2018 04:38:27 +0100 Original-Received: from localhost ([::1]:42009 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gNUzl-0007qd-6E for geh-help-gnu-emacs@m.gmane.org; Thu, 15 Nov 2018 22:40:37 -0500 X-Received: by 2002:ac8:3870:: with SMTP id r45mr2157843qtb.1.1542339500720; Thu, 15 Nov 2018 19:38:20 -0800 (PST) X-Received: by 2002:a0c:8b4c:: with SMTP id d12mr34169qvc.3.1542339499904; Thu, 15 Nov 2018 19:38:19 -0800 (PST) Original-Path: usenet.stanford.edu!g188-v6no18009itg.0!news-out.google.com!m21ni1207qta.0!nntp.google.com!z5-v6no18738ite.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail Original-Newsgroups: gnu.emacs.help In-Reply-To: Complaints-To: groups-abuse@google.com Original-Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=117.222.3.31; posting-account=mBpa7woAAAAGLEWUUKpmbxm-Quu5D8ui Original-NNTP-Posting-Host: 117.222.3.31 Original-Xref: usenet.stanford.edu gnu.emacs.help:224552 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "help-gnu-emacs" Xref: news.gmane.org gmane.emacs.help:118681 Archived-At: On Sunday, November 11, 2018 at 3:53:08 AM UTC+5:30, Bob Proulx wrote: > Alexandre Garreau wrote: > > Bob Proulx wrote: > > > Public Service Announcement: Please do not reply to spam. If a valid > > > message is in reply to a spam message then it refers to it and in a > > > sense validates it. To talk about spam please use an independent > > > thread so as not to validate the original spam. > > > > Why so? > > The best anti-spam engines in practice are learning engines such as > Bayes and other. Spam characteristics change so quickly and their > human senders keep trying to be more sneaky than before. We use no > fewer than three! SpamAssassin, Bogofilter, and CRM114. By far > CRM114 is the best of those three. But there are subtle differences > that keep me playing one off the other and therefore continuing to add > engines rather than remove them. > > Since they are learning engines they must be trained in order to > learn. The best training has been training on error. When the > classification is different it must be corrected. > > All messages are fed through the anti-spam classification engines > twice. Once on the frontend in order to classify the message to > determine if it should be automatically discarded. And then once > again after the messages go through the mailing list to train on any > errors. Since the mailing lists are relatively spam free (IMNHO) then > I assume that any message through the mailing list is a desired > message. If any of the learning engines think otherwise then it > triggers training to learn that message as non-spam. > > SpamAssassin knows the structure of email, what's a header and what is > the body. Bogofilter and CRM114 have no knowledge of email structure > and process the message as a raw file looking at tokens in the headers > and structure and learning them as either indicators or not > dynamically. For them this includes IP addresses and email addresses > and everything. Everything is open to gripping upon. > > Just recently, due to our conversations about the newsgroup gateway > here, I have modified this algorithm slightly. I now look for the > newsgroup gateway header. If a message entered through the newsgroup > then I ignore it. There isn't anything I can do about it. Training > on it makes no sense. Therefore I ignore it. No training. But until > recently I did train on newsgroup messages too. > > If someone replies to the message then the email headers and the > structure of it and, goodness forbid if they quote any of the message > (top posting on the entire spam is worst), then all of that may have > been associated with spam but when it comes through the mailing list > now it will be associated with non-spam. Training the learning > engines on it will pull the database to thinking that that type of > message, spam though it is, is desirable on the mailing list and will > pass it through in the future. It will eventually correct but may > take a while. A while being around a month for the size of the token > database we keep. From week to week the trend in spam changes. > > > If not sending anything to whoever sent the mail, will they > > track the mailing-list or its archive to find some other mail referring > > to it, and take this as an encouragement and post more spam? > > Not likely. I think for spammers it is mostly send and forget (like a > "fire and forget" military missile). > > > Otherwise, what's the problem of validation if it's for a single spam? > > Let's say someone got their antispam block that spam: it seems to me > > normal, whenever a discussion is being about some spam that has been > > relayed by the list, that the user either see the aforementioned spam, > > to aknowledge the problem other are living (and get a sample of it), or > > not to see the thread at all, as they're not concerned. > > If it is a single spam it isn't the end of the world. It is all just > incremental. Because it will be used to train the learning engines. > And they will recover given enough time and good later input. But > every little bit counts! > > > Ideally there should be a way to trigger metadata so that when you > > answer to something you do while marking it as spam for people seeing > > your message, like a mail header for it. > > There are systems in use where the community can vote upon messages. > They usually require multiple votes, say five, from known quality > voters, and then the message is hidden. But mostly we see those with > web page forums. Since this is a mailing list in order to install > such a thing we would need to have users trained on how to do this. > > As another data point in this area the Debian mailing lists have an > address where people can "bounce" the spam to for further training of > their anti-spam learning engines. And as a notification to the > listmaster that spam is flowing in and needs help to be blocked (they > use procmail rules, we do too) if they get a new type that slips > through. (Mutt has a 'b'ounce mail action, other mailers may or may > not.) We could set up something like that but one does not exist at > the moment. With some more work it could be useful if people were to > contribute spams that slip through into the mailing list to it. > > Sorry for the long delay in answering this message. Life and time is > what keeps everything from happening all at once. > > Bob You seem to be managing a splendid job with ML-news gateway spam [ Compare https://groups.google.com/forum/#!forum/comp.lang.python ] Wonder how easy it would be for you to share your know-how in capsule/summary?? (assuming the folks managing comp.lang.python are interested]