From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <notmuch-bounces@notmuchmail.org>
Received: from mp1 ([2001:41d0:2:4a6f::])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	by ms11 with LMTPS
	id UO5ZKwsXyF8wLgAA0tVLHw
	(envelope-from <notmuch-bounces@notmuchmail.org>)
	for <larch@yhetil.org>; Wed, 02 Dec 2020 22:36:59 +0000
Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	by mp1 with LMTPS
	id eKoaJwsXyF/SWQAAbx9fmQ
	(envelope-from <notmuch-bounces@notmuchmail.org>)
	for <larch@yhetil.org>; Wed, 02 Dec 2020 22:36:59 +0000
Received: from mail.notmuchmail.org (nmbug.tethera.net [144.217.243.247])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 server-signature RSA-PSS (2048 bits))
	(No client certificate requested)
	by aspmx1.migadu.com (Postfix) with ESMTPS id 30D869403A8
	for <larch@yhetil.org>; Wed,  2 Dec 2020 22:36:58 +0000 (UTC)
Received: from nmbug.tethera.net (localhost [127.0.0.1])
	by mail.notmuchmail.org (Postfix) with ESMTP id 7346928C87;
	Wed,  2 Dec 2020 17:36:49 -0500 (EST)
X-Greylist: delayed 400 seconds by postgrey-1.36 at nmbug; Wed, 02 Dec 2020 17:36:46 EST
Received: from avior.uberspace.de (avior.uberspace.de [185.26.156.32])
	by mail.notmuchmail.org (Postfix) with ESMTPS id 6A4BF28C5B
	for <notmuch@notmuchmail.org>; Wed,  2 Dec 2020 17:36:46 -0500 (EST)
Received: (qmail 19690 invoked from network); 2 Dec 2020 22:30:03 -0000
Received: from localhost (HELO europa) (127.0.0.1)
  by avior.uberspace.de with SMTP; 2 Dec 2020 22:30:03 -0000
Received: from localhost.lan ([127.0.0.1] helo=localhost)
	by europa with esmtp (Exim 4.94)
	(envelope-from <justus@sequoia-pgp.org>)
	id 1kkadO-00EW1k-1C
	for notmuch@notmuchmail.org; Wed, 02 Dec 2020 23:30:02 +0100
From: Justus Winter <justus@sequoia-pgp.org>
To: notmuch@notmuchmail.org
Subject: Machine-learning-based tagging solution prototype
Date: Wed, 02 Dec 2020 23:30:01 +0100
Message-ID: <87h7p396za.fsf@europa.jade-hamburg.de>
MIME-Version: 1.0
Message-ID-Hash: 7JU6GIN365UP42XN5J7YNJSYSCLDU5SW
X-Message-ID-Hash: 7JU6GIN365UP42XN5J7YNJSYSCLDU5SW
X-MailFrom: justus@sequoia-pgp.org
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-notmuch.notmuchmail.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; suspicious-header
X-Mailman-Version: 3.2.1
Precedence: list
List-Id: "Use and development of the notmuch mail system." <notmuch.notmuchmail.org>
List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>
List-Post: <mailto:notmuch@notmuchmail.org>
List-Subscribe: <mailto:notmuch-join@notmuchmail.org>
List-Unsubscribe: <mailto:notmuch-leave@notmuchmail.org>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Migadu-Flow: FLOW_IN
X-Migadu-Spam-Score: -2.12
Authentication-Results: aspmx1.migadu.com;
	dkim=none;
	dmarc=none;
	spf=pass (aspmx1.migadu.com: domain of notmuch-bounces@notmuchmail.org designates 144.217.243.247 as permitted sender) smtp.mailfrom=notmuch-bounces@notmuchmail.org
X-Migadu-Queue-Id: 30D869403A8
X-Spam-Score: -2.12
X-Migadu-Scanner: ns3122888.ip-94-23-21.eu
X-TUID: xBSmHUqykvQ3

'sup :)

tl;dr: Please meet blaecksprutte[0], a machine-learning-based tagging
solution for notmuch.  This is a prototype, but I've been using it as
my sole tagging solution for three years now.

0: https://github.com/teythoon/blaecksprutte

Long version: When I started using notmuch, writing custom tagging
scripts was part of the setup process.  It seems that everyone had to
reinvent the wheel, so I started working on afew.  afew worked fine,
but it had to be fed rules on how to tag mails, a tedious process.  I
always wanted a smarter tagging solution, one that could learn how to
label my mails, like a human secretary could.  I tried to implement
that in afew, but my approach was naive, and never worked really well.

(For the curious, it was based on dbacl: http://dbacl.sourceforge.net/)

But the idea kept spooking around in my head, and three years ago I
convinced someone who is smarter than me and actually understood
something about machine learning to implement it.  And it works very
well for me.  Not that it doesn't make mistakes, but to err is human,
after all, and I don't have to worry about writing and updating rules.
For me, it is a worthwhile trade-off.

The algorithm is roughly: Bag of words, then multi-class, multi-label
classification using a support vector machine.  All the heavy lifting
is done by scikit-learn.  I may very well be getting things wrong.

Here is what it looks like.  You can try it yourself, until you
execute the tag action, it doesn't modify your database:

  % python3 blaecksprutte.py -h
  usage: blaecksprutte.py [-h] [--verbose] [--progress] {train,tag,validate} ...

  positional arguments:
    {train,tag,validate}
      train               train the tagger from standard notmuch database
      tag                 tag the mails with a new-tag
      validate            show a classification report on stdout when trained on 0.6 of the maildir and tested on the other 0.4.

  optional arguments:
    -h, --help            show this help message and exit
    --verbose             print logging messages to stdout
    --progress            print a progress bar
  % python3 blaecksprutte.py --progress train
  3%|###                                               |ETA:  0:23:44
  [... later that day...]
  % python3 blaecksprutte.py --progress validate
  100%|################################################|Time: 0:15:07
  /usr/lib/python3/dist-packages/sklearn/multiclass.py:76: UserWarning: Label not 0 is present in all training examples.
    warnings.warn("Label %s is present in all training examples." %
  [... loads of those warnings omitted...]
                   precision    recall  f1-score   support

             afew       0.00      0.00      0.00         0
  [... loads of more tags omitted for privacy...]

(The astute reader may notice that a precision and recall of 0.00 is
not great.  It used to display more plausible values there.  Maybe
it's got to do with those pesky warnings.  Someone(TM) better look
into that.  I hope that tagging still works...)

The code is at [0].  This actually my fork, I took the most usable
state of the prototype, tweaked it over the years, ported it to Python
3.  If you want to use it or experiment with it, I suggest starting
from this point.

This is a code dump.  This mail is the best documentation of the tool,
the code needs some love, there are problems and quirks.  It is far
from a polished project.  I feel bad for throwing it over the fence
like that, but not publishing it is also unsatisfactory.

I'd love to see someone pick it up, improve it, polish it, love it.
I'm happy to help with that if I can, of course, mostly by
braindumping.  If there was a well-maintained version of it, I'd
switch to it in a minute.

Maybe it can be developed under the umbrella of the afew project, even
integrated into the main project.

-- Shoutout to the afew developers!  You taking the stagnating afew off
-- my shoulders and caring for it was the best thing that ever happened
-- to a project of mine.  You are the best <3

Happy hacking,
Justus