unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* Machine-learning-based tagging solution prototype
@ 2020-12-02 22:30 Justus Winter
  0 siblings, 0 replies; only message in thread
From: Justus Winter @ 2020-12-02 22:30 UTC (permalink / raw)
  To: notmuch

'sup :)

tl;dr: Please meet blaecksprutte[0], a machine-learning-based tagging
solution for notmuch.  This is a prototype, but I've been using it as
my sole tagging solution for three years now.

0: https://github.com/teythoon/blaecksprutte

Long version: When I started using notmuch, writing custom tagging
scripts was part of the setup process.  It seems that everyone had to
reinvent the wheel, so I started working on afew.  afew worked fine,
but it had to be fed rules on how to tag mails, a tedious process.  I
always wanted a smarter tagging solution, one that could learn how to
label my mails, like a human secretary could.  I tried to implement
that in afew, but my approach was naive, and never worked really well.

(For the curious, it was based on dbacl: http://dbacl.sourceforge.net/)

But the idea kept spooking around in my head, and three years ago I
convinced someone who is smarter than me and actually understood
something about machine learning to implement it.  And it works very
well for me.  Not that it doesn't make mistakes, but to err is human,
after all, and I don't have to worry about writing and updating rules.
For me, it is a worthwhile trade-off.

The algorithm is roughly: Bag of words, then multi-class, multi-label
classification using a support vector machine.  All the heavy lifting
is done by scikit-learn.  I may very well be getting things wrong.

Here is what it looks like.  You can try it yourself, until you
execute the tag action, it doesn't modify your database:

  % python3 blaecksprutte.py -h
  usage: blaecksprutte.py [-h] [--verbose] [--progress] {train,tag,validate} ...

  positional arguments:
    {train,tag,validate}
      train               train the tagger from standard notmuch database
      tag                 tag the mails with a new-tag
      validate            show a classification report on stdout when trained on 0.6 of the maildir and tested on the other 0.4.

  optional arguments:
    -h, --help            show this help message and exit
    --verbose             print logging messages to stdout
    --progress            print a progress bar
  % python3 blaecksprutte.py --progress train
  3%|###                                               |ETA:  0:23:44
  [... later that day...]
  % python3 blaecksprutte.py --progress validate
  100%|################################################|Time: 0:15:07
  /usr/lib/python3/dist-packages/sklearn/multiclass.py:76: UserWarning: Label not 0 is present in all training examples.
    warnings.warn("Label %s is present in all training examples." %
  [... loads of those warnings omitted...]
                   precision    recall  f1-score   support

             afew       0.00      0.00      0.00         0
  [... loads of more tags omitted for privacy...]

(The astute reader may notice that a precision and recall of 0.00 is
not great.  It used to display more plausible values there.  Maybe
it's got to do with those pesky warnings.  Someone(TM) better look
into that.  I hope that tagging still works...)

The code is at [0].  This actually my fork, I took the most usable
state of the prototype, tweaked it over the years, ported it to Python
3.  If you want to use it or experiment with it, I suggest starting
from this point.

This is a code dump.  This mail is the best documentation of the tool,
the code needs some love, there are problems and quirks.  It is far
from a polished project.  I feel bad for throwing it over the fence
like that, but not publishing it is also unsatisfactory.

I'd love to see someone pick it up, improve it, polish it, love it.
I'm happy to help with that if I can, of course, mostly by
braindumping.  If there was a well-maintained version of it, I'd
switch to it in a minute.

Maybe it can be developed under the umbrella of the afew project, even
integrated into the main project.

-- Shoutout to the afew developers!  You taking the stagnating afew off
-- my shoulders and caring for it was the best thing that ever happened
-- to a project of mine.  You are the best <3

Happy hacking,
Justus

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2020-12-02 22:36 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-02 22:30 Machine-learning-based tagging solution prototype Justus Winter

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).