unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
From: David Bremner <david@tethera.net>
To: "notmuch mailing list" <notmuch@notmuchmail.org>
Subject: On disk tag storage format
Date: Thu, 29 Nov 2012 09:01:23 -0400	[thread overview]
Message-ID: <874nk8v9zw.fsf@zancas.localnet> (raw)

[-- Attachment #1: Type: text/plain, Size: 1626 bytes --]


Austin outlined on IRC a way of representing tags on disk as hardlinks
to messages. In order to make the discussion more concrete, I wrote a
prototype in python to dump the notmuch database to this format. On my
250k messages, this creates 40k new hardlinks, and uses about 5M of
diskspace. The dump process takes about 20s on
my core i7 machine.  With symbolic links, the same database takes about
150M of disk space; this isn't great but it isn't unbearable either.

The assumption in both cases is that maildirsync is on, so most tags are
stored in the the original maildirs.

In principle such a representation (or some variation) could be be used
to interect with some external source of tagging information like gmail.
It could also be used (with rsync --hard-links?) to synchronize notmuch
databases between machines.

I'm still unsure about the runtime performance impact of updating the
file system and the Xapian index with every tag operation, but I thought
I would see if the representation itself was usable for most people
without bringing the filesystem to its knees. So I'd be interested to
hear other people experiences running this script. It _should_ be safe
since it opens the database in readonly form, but the smart money is on
backups before running other peoples experimental code.  Especially
since I don't claim to actually know python.

One technicality is that this hex-encodes ':' (compared to the other code
floating around); this is so hex_encode(message_id)+maildir_flags is a
valid maildir name. The uniqueness of the names comes from the (much
discussed) keying of messages on message-ids.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: dump-tags.py --]
[-- Type: text/x-python, Size: 1952 bytes --]

import notmuch
import re
import os, errno

maildirish= re.compile(r"^(draft|flagged|passed|replied|unread)$")

symlink = False

# some random person on stack overflow suggests:

def mkdir_p(path):
    try:
        os.makedirs(path)
    except OSError as exc: # Python >2.5
        if exc.errno == errno.EEXIST and os.path.isdir(path):
            pass
        else: raise

CHARSET=('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+_@=.,-')

encode_re='([^{0}])'.format(CHARSET)

def encode_one_char(match):
    return('%{:02x}'.format(ord(match.group(1))))

def encode_for_fs(str):
    return re.sub(encode_re,encode_one_char, str,0)

flagre = re.compile("(:2,[^:]*)$");

tagroot='tags'

db = notmuch.Database(mode=notmuch.Database.MODE.READ_ONLY)
q_new = notmuch.Query(db, '*')
q_new.set_sort(notmuch.Query.SORT.UNSORTED)
for msg in q_new.search_messages():
    for tag in  msg.get_tags():
        if tag == '':
            print 'Dunno what to do about empty tag on ', msg.get_message_id()
        else:
            if not maildirish.match(tag):
                # ignore multiple filenames
                filename = msg.get_filename()
                message_id = msg.get_message_id()
                flagsmatch = flagre.search(filename)
                if flagsmatch == None:
                    flags = ''
                else:
                    flags = flagsmatch.group(1)
                    
                    tagdir = os.path.join(tagroot, encode_for_fs(tag)) 
                    curdir = os.path.join(tagdir, 'cur') 
                    mkdir_p (os.path.join(tagdir, 'new'))
                    mkdir_p ( os.path.join(tagdir, 'tmp'))
                    mkdir_p(curdir);

                    newlink = os.path.join(curdir, encode_for_fs(message_id) + flags)
                    if symlink:
                        os.symlink(filename, newlink)
                    else:
                        os.link(filename, newlink )

             reply	other threads:[~2012-11-29 13:01 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-29 13:01 David Bremner [this message]
2012-11-29 19:34 ` On disk tag storage format Eirik Byrkjeflot Anonsen
2012-11-30  7:31   ` Tomi Ollila
2013-02-21  1:29 ` David Bremner
2013-10-05  1:28   ` Ethan Glasser-Camp
2013-10-07  4:49     ` Ethan Glasser-Camp

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://notmuchmail.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=874nk8v9zw.fsf@zancas.localnet \
    --to=david@tethera.net \
    --cc=notmuch@notmuchmail.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).