From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp1 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id UO5ZKwsXyF8wLgAA0tVLHw (envelope-from ) for ; Wed, 02 Dec 2020 22:36:59 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp1 with LMTPS id eKoaJwsXyF/SWQAAbx9fmQ (envelope-from ) for ; Wed, 02 Dec 2020 22:36:59 +0000 Received: from mail.notmuchmail.org (nmbug.tethera.net [144.217.243.247]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (2048 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 30D869403A8 for ; Wed, 2 Dec 2020 22:36:58 +0000 (UTC) Received: from nmbug.tethera.net (localhost [127.0.0.1]) by mail.notmuchmail.org (Postfix) with ESMTP id 7346928C87; Wed, 2 Dec 2020 17:36:49 -0500 (EST) X-Greylist: delayed 400 seconds by postgrey-1.36 at nmbug; Wed, 02 Dec 2020 17:36:46 EST Received: from avior.uberspace.de (avior.uberspace.de [185.26.156.32]) by mail.notmuchmail.org (Postfix) with ESMTPS id 6A4BF28C5B for ; Wed, 2 Dec 2020 17:36:46 -0500 (EST) Received: (qmail 19690 invoked from network); 2 Dec 2020 22:30:03 -0000 Received: from localhost (HELO europa) (127.0.0.1) by avior.uberspace.de with SMTP; 2 Dec 2020 22:30:03 -0000 Received: from localhost.lan ([127.0.0.1] helo=localhost) by europa with esmtp (Exim 4.94) (envelope-from ) id 1kkadO-00EW1k-1C for notmuch@notmuchmail.org; Wed, 02 Dec 2020 23:30:02 +0100 From: Justus Winter To: notmuch@notmuchmail.org Subject: Machine-learning-based tagging solution prototype Date: Wed, 02 Dec 2020 23:30:01 +0100 Message-ID: <87h7p396za.fsf@europa.jade-hamburg.de> MIME-Version: 1.0 Message-ID-Hash: 7JU6GIN365UP42XN5J7YNJSYSCLDU5SW X-Message-ID-Hash: 7JU6GIN365UP42XN5J7YNJSYSCLDU5SW X-MailFrom: justus@sequoia-pgp.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-notmuch.notmuchmail.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; suspicious-header X-Mailman-Version: 3.2.1 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Help: List-Post: List-Subscribe: List-Unsubscribe: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Migadu-Flow: FLOW_IN X-Migadu-Spam-Score: -2.12 Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=none; spf=pass (aspmx1.migadu.com: domain of notmuch-bounces@notmuchmail.org designates 144.217.243.247 as permitted sender) smtp.mailfrom=notmuch-bounces@notmuchmail.org X-Migadu-Queue-Id: 30D869403A8 X-Spam-Score: -2.12 X-Migadu-Scanner: ns3122888.ip-94-23-21.eu X-TUID: xBSmHUqykvQ3 'sup :) tl;dr: Please meet blaecksprutte[0], a machine-learning-based tagging solution for notmuch. This is a prototype, but I've been using it as my sole tagging solution for three years now. 0: https://github.com/teythoon/blaecksprutte Long version: When I started using notmuch, writing custom tagging scripts was part of the setup process. It seems that everyone had to reinvent the wheel, so I started working on afew. afew worked fine, but it had to be fed rules on how to tag mails, a tedious process. I always wanted a smarter tagging solution, one that could learn how to label my mails, like a human secretary could. I tried to implement that in afew, but my approach was naive, and never worked really well. (For the curious, it was based on dbacl: http://dbacl.sourceforge.net/) But the idea kept spooking around in my head, and three years ago I convinced someone who is smarter than me and actually understood something about machine learning to implement it. And it works very well for me. Not that it doesn't make mistakes, but to err is human, after all, and I don't have to worry about writing and updating rules. For me, it is a worthwhile trade-off. The algorithm is roughly: Bag of words, then multi-class, multi-label classification using a support vector machine. All the heavy lifting is done by scikit-learn. I may very well be getting things wrong. Here is what it looks like. You can try it yourself, until you execute the tag action, it doesn't modify your database: % python3 blaecksprutte.py -h usage: blaecksprutte.py [-h] [--verbose] [--progress] {train,tag,validate} ... positional arguments: {train,tag,validate} train train the tagger from standard notmuch database tag tag the mails with a new-tag validate show a classification report on stdout when trained on 0.6 of the maildir and tested on the other 0.4. optional arguments: -h, --help show this help message and exit --verbose print logging messages to stdout --progress print a progress bar % python3 blaecksprutte.py --progress train 3%|### |ETA: 0:23:44 [... later that day...] % python3 blaecksprutte.py --progress validate 100%|################################################|Time: 0:15:07 /usr/lib/python3/dist-packages/sklearn/multiclass.py:76: UserWarning: Label not 0 is present in all training examples. warnings.warn("Label %s is present in all training examples." % [... loads of those warnings omitted...] precision recall f1-score support afew 0.00 0.00 0.00 0 [... loads of more tags omitted for privacy...] (The astute reader may notice that a precision and recall of 0.00 is not great. It used to display more plausible values there. Maybe it's got to do with those pesky warnings. Someone(TM) better look into that. I hope that tagging still works...) The code is at [0]. This actually my fork, I took the most usable state of the prototype, tweaked it over the years, ported it to Python 3. If you want to use it or experiment with it, I suggest starting from this point. This is a code dump. This mail is the best documentation of the tool, the code needs some love, there are problems and quirks. It is far from a polished project. I feel bad for throwing it over the fence like that, but not publishing it is also unsatisfactory. I'd love to see someone pick it up, improve it, polish it, love it. I'm happy to help with that if I can, of course, mostly by braindumping. If there was a well-maintained version of it, I'd switch to it in a minute. Maybe it can be developed under the umbrella of the afew project, even integrated into the main project. -- Shoutout to the afew developers! You taking the stagnating afew off -- my shoulders and caring for it was the best thing that ever happened -- to a project of mine. You are the best <3 Happy hacking, Justus