From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp10.migadu.com ([2001:41d0:8:6d80::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms5.migadu.com with LMTPS id qOs8BxVdtGJGcAAAbAwnHQ (envelope-from ) for ; Thu, 23 Jun 2022 14:31:17 +0200 Received: from aspmx1.migadu.com ([2001:41d0:8:6d80::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp10.migadu.com with LMTPS id 8EpGBhVdtGIyhgAAG6o9tA (envelope-from ) for ; Thu, 23 Jun 2022 14:31:17 +0200 Received: from mail.notmuchmail.org (yantan.tethera.net [135.181.149.255]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id C110A2F1A0 for ; Thu, 23 Jun 2022 14:31:16 +0200 (CEST) Received: from yantan.tethera.net (localhost [127.0.0.1]) by mail.notmuchmail.org (Postfix) with ESMTP id 8EE195F6E8; Thu, 23 Jun 2022 12:31:14 +0000 (UTC) Received: from fethera.tethera.net (fethera.tethera.net [198.245.60.197]) by mail.notmuchmail.org (Postfix) with ESMTP id 649A45F6B6 for ; Thu, 23 Jun 2022 12:31:12 +0000 (UTC) Received: by fethera.tethera.net (Postfix, from userid 1001) id C775D5FBD0; Thu, 23 Jun 2022 08:31:11 -0400 (EDT) Received: (nullmailer pid 1287097 invoked by uid 1000); Thu, 23 Jun 2022 12:31:10 -0000 From: David Bremner To: notmuch@notmuchmail.org Subject: [PATCH 1/2] CL/git: add format version 1 Date: Thu, 23 Jun 2022 09:30:44 -0300 Message-Id: <20220623123044.1283154-2-david@tethera.net> X-Mailer: git-send-email 2.35.2 In-Reply-To: <20220623123044.1283154-1-david@tethera.net> References: <20220623123044.1283154-1-david@tethera.net> MIME-Version: 1.0 Message-ID-Hash: TEYBM4WLPCP6A2HVEQQZYGSWPRJ2EXVI X-Message-ID-Hash: TEYBM4WLPCP6A2HVEQQZYGSWPRJ2EXVI X-MailFrom: bremner@tethera.net X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-notmuch.notmuchmail.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.3 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Migadu-Flow: FLOW_IN X-Migadu-To: larch@yhetil.org X-Migadu-Country: DE ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1655987476; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-owner:list-unsubscribe:list-subscribe:list-post; bh=wFhAhhLf13a8bN7dg22cwuouUOR/TCYC3uQFXaImN+w=; b=sKHkZu4Lin1rzAl1iJFrv39pC5kE+AXIML7R9iOMsdf1JUpK8fwkHCyQHB8zVSX/xD8z8B Fp+rWewZtbquWBYZkQdMxUGMKOcJ2PMk9OTzuTGUlhBtY8rEJPqMRYkSQh64uohT8qgW7v 6vqh+GLso9HrXFwmL/lVYHme0QaAWu5NoKIDIsBcJ+8R9QR1dXgLL21V8pdMMkOOp2d8K2 3XYtVV3HIDi8j1h7vXavZMVuN7ttjAOZL9C4/imCtt28GCheBGGtBsjBtQ6oClVdZ6svYV MkAFpImIZzMUNELTQ1ipGptW3l145rEPw3qjteY3CLkiWOi1J7+J0NrNkHvTHA== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1655987476; a=rsa-sha256; cv=none; b=hAAuOmONVjxdG/U21rct42cVxt/xQGwOXXoOlFWijeSePiENAaUPvUgLXLSSHFby/pJOCt JxggA9n5T072Ph9Br+hJmxlbY6CNHWAx5bdlxFM1hL7RRbvaUuNPluhF0X5ka2cgkvwHEa l7x0j/8vNkTwmDEzbNtvMedDm/u8IxjlTORCvDMXC50IcTlOlRejks7LFhFNvjy0HSn/mn jRXhr9mu0XChu9GLOcj7ZdmaBH19nWNGYUFtPlgd0D6R0QFvyg2KLtwy95HXftWLf+KIi9 WonhkGBtN691/kqshnvEdu+1xzL3bhf+abuPPvnkEZMCBqEBngcVpVr4shL96A== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=none; dmarc=none; spf=pass (aspmx1.migadu.com: domain of notmuch-bounces@notmuchmail.org designates 135.181.149.255 as permitted sender) smtp.mailfrom=notmuch-bounces@notmuchmail.org X-Migadu-Spam-Score: -1.65 Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=none; spf=pass (aspmx1.migadu.com: domain of notmuch-bounces@notmuchmail.org designates 135.181.149.255 as permitted sender) smtp.mailfrom=notmuch-bounces@notmuchmail.org X-Migadu-Queue-Id: C110A2F1A0 X-Spam-Score: -1.65 X-Migadu-Scanner: scn0.migadu.com X-TUID: /W7jzvF16fDD The original nmbug format (now called version 0) creates 1 subdirectory of 'tags/' per message. This causes problems for more than (roughly) 100k messages. Version 1 introduces 2 layers of hashed directories. This scheme was chose to balance the number of subdirectories with the number of extra directories (and git objects) created via hashing. This should be upward compatible in the sense that old repositories will continue to work with the updated notmuch-git. --- doc/man1/notmuch-git.rst | 40 +++++++++++++++++++++++++++---- notmuch-git.py | 52 ++++++++++++++++++++++++++++++++++------ test/T850-git.sh | 48 ++++++++++++++++++------------------- test/test-lib.sh | 4 ++++ 4 files changed, 109 insertions(+), 35 deletions(-) diff --git a/doc/man1/notmuch-git.rst b/doc/man1/notmuch-git.rst index fa7a748e..389371bf 100644 --- a/doc/man1/notmuch-git.rst +++ b/doc/man1/notmuch-git.rst @@ -235,15 +235,47 @@ REPOSITORY CONTENTS =================== The tags are stored in the git repo (and exported) as a set of empty -files. For a message with Message-Id *id*, for each tag *tag*, there +files. These empty files are contained within a directory named after +the message-id. + +In what follows `encode()` represents a POSIX filesystem safe +encoding. The encoding preserves alphanumerics, and the characters +`+-_@=.,:`. All other octets are replaced with `%` followed by a two +digit hex number. + +Currently :any:`notmuch-git` can read any format version, but can only +create (via :any:`init`) :ref:`version 1 ` repositories. + +.. _format_version_0: + +Version 0 +--------- + +This is the legacy format created by the `nmbug` tool prior to release +0.37. For a message with Message-Id *id*, for each tag *tag*, there is an empty file with path tags/ `encode` (*id*) / `encode` (*tag*) -The encoding preserves alphanumerics, and the characters `+-_@=.,:`. -All other octets are replaced with `%` followed by a two digit hex -number. +.. _format_version_1: + +Version 1 +--------- + +In format version 1 and later, the format version is contained in a +top level file called FORMAT. + +For a message with Message-Id *id*, for each tag *tag*, there +is an empty file with path + + tags/ `hash1` (*id*) / `hash2` (*id*) `encode` (*id*) / `encode` (*tag*) + +The hash functions each represent one byte of the `blake2b` hex +digest. +Compared to :ref:`version 0 `, this reduces the +number of subdirectories within each directory. + .. _repo_location: REPOSITORY LOCATION diff --git a/notmuch-git.py b/notmuch-git.py index f188660c..8781e3d1 100644 --- a/notmuch-git.py +++ b/notmuch-git.py @@ -46,10 +46,12 @@ _LOG.addHandler(_logging.StreamHandler()) NOTMUCH_GIT_DIR = None TAG_PREFIX = None +FORMAT_VERSION = 0 _HEX_ESCAPE_REGEX = _re.compile('%[0-9A-F]{2}') _TAG_DIRECTORY = 'tags/' -_TAG_FILE_REGEX = _re.compile(_TAG_DIRECTORY + '(?P[^/]*)/(?P[^/]*)') +_TAG_FILE_REGEX = ( _re.compile(_TAG_DIRECTORY + '(?P[^/]*)/(?P[^/]*)'), + _re.compile(_TAG_DIRECTORY + '([0-9a-f]{2}/){2}(?P[^/]*)/(?P[^/]*)')) # magic hash for Git (git hash-object -t blob /dev/null) _EMPTYBLOB = 'e69de29bb2d1d6434b8b29ae775ad8c2e48c5391' @@ -265,7 +267,7 @@ def archive(treeish='HEAD', args=()): Each tag $tag for message with Message-Id $id is written to an empty file - tags/encode($id)/encode($tag) + tags/hash1(id)/hash2(id)/encode($id)/encode($tag) The encoding preserves alphanumerics, and the characters "+-_@=.:," (not the quotes). All other octets are replaced with @@ -469,9 +471,17 @@ def init(remote=None): _git(args=['config', 'core.logallrefupdates', 'true'], wait=True) # create an empty blob (e69de29bb2d1d6434b8b29ae775ad8c2e48c5391) _git(args=['hash-object', '-w', '--stdin'], input='', wait=True) + # create a blob for the FORMAT file + (status, stdout, _) = _git(args=['hash-object', '-w', '--stdin'], stdout=_subprocess.PIPE, + input='1\n', wait=True) + verhash=stdout.rstrip() + _LOG.debug('hash of FORMAT blob = {:s}'.format(verhash)) + # Add FORMAT to the index + _git(args=['update-index', '--add', '--cacheinfo', '100644,{:s},FORMAT'.format(verhash)], wait=True) + _git( args=[ - 'commit', '--allow-empty', '-m', 'Start a new nmbug repository' + 'commit', '-m', 'Start a new notmuch-git repository' ], additional_env={'GIT_WORK_TREE': NOTMUCH_GIT_DIR}, wait=True) @@ -821,7 +831,7 @@ def _clear_tags_for_message(index, id): Neither 'id' nor the tags in 'tags' should be encoded/escaped. """ - dir = 'tags/{id}'.format(id=_hex_quote(string=id)) + dir = _id_path(id) with _git( args=['ls-files', dir], @@ -838,6 +848,21 @@ def _read_database_lastmod(): (count,uuid,lastmod_str) = notmuch.stdout.readline().split() return (count,uuid,int(lastmod_str)) +def _id_path(id): + hid=_hex_quote(string=id) + from hashlib import blake2b + + if FORMAT_VERSION==0: + return 'tags/{hid}'.format(hid=hid) + elif FORMAT_VERSION==1: + idhash = blake2b(hid.encode('utf8'), digest_size=2).hexdigest() + return 'tags/{dir1}/{dir2}/{hid}'.format( + hid=hid, + dir1=idhash[0:2],dir2=idhash[2:]) + else: + _LOG.error("Unknown format version",FORMAT_VERSION) + _sys.exit(1) + def _index_tags_for_message(id, status, tags): """ Update the Git index to either create or delete an empty file. @@ -852,8 +877,7 @@ def _index_tags_for_message(id, status, tags): hash = '0000000000000000000000000000000000000000' for tag in tags: - path = 'tags/{id}/{tag}'.format( - id=_hex_quote(string=id), tag=_hex_quote(string=tag)) + path = '{ipath}/{tag}'.format(ipath=_id_path(id),tag=_hex_quote(string=tag)) yield '{mode} {hash}\t{path}\n'.format(mode=mode, hash=hash, path=path) @@ -869,7 +893,7 @@ def _diff_refs(filter, a='HEAD', b='@{upstream}'): def _unpack_diff_lines(stream): "Iterate through (id, tag) tuples in a diff stream." for line in stream: - match = _TAG_FILE_REGEX.match(line.strip()) + match = _TAG_FILE_REGEX[FORMAT_VERSION].match(line.strip()) if not match: message = 'non-tag line in diff: {!r}'.format(line.strip()) if line.startswith(_TAG_DIRECTORY): @@ -907,6 +931,17 @@ def _notmuch_config_get(key): _sys.exit(1) return stdout.rstrip() +def read_format_version(): + try: + (status, stdout, stderr) = _git( + args=['cat-file', 'blob', 'master:FORMAT'], + stdout=_subprocess.PIPE, stderr=_subprocess.PIPE, wait=True) + except SubprocessError as e: + _LOG.debug("failed to read FORMAT file from git, assuming format version 0") + return 0 + + return int(stdout) + # based on BaseDirectory.save_data_path from pyxdg (LGPL2+) def xdg_data_path(profile): resource = _os.path.join('notmuch',profile,'git') @@ -1104,6 +1139,9 @@ if __name__ == '__main__': _LOG.debug('prefix = {:s}'.format(TAG_PREFIX)) _LOG.debug('repository = {:s}'.format(NOTMUCH_GIT_DIR)) + FORMAT_VERSION = read_format_version() + _LOG.debug('FORMAT_VERSION={:d}'.format(FORMAT_VERSION)) + if args.func == help: arg_names = ['command'] else: diff --git a/test/T850-git.sh b/test/T850-git.sh index 7ea50939..342cc31b 100755 --- a/test/T850-git.sh +++ b/test/T850-git.sh @@ -40,10 +40,10 @@ notmuch tag -new-prefix::foo id:20091117190054.GU3165@dottiness.seas.harvard.edu test_begin_subtest "committing new prefix works with force" notmuch tag +new-prefix::foo id:20091117190054.GU3165@dottiness.seas.harvard.edu notmuch git -l debug -p 'new-prefix::' -C force-prefix.git commit --force -git -C force-prefix.git ls-tree -r --name-only HEAD | xargs dirname | sort -u | sed s,tags/,id:, > OUTPUT +git -C force-prefix.git ls-tree -r --name-only HEAD | notmuch_git_sanitize | xargs dirname | sort -u > OUTPUT notmuch tag -new-prefix::foo id:20091117190054.GU3165@dottiness.seas.harvard.edu cat <EXPECTED -id:20091117190054.GU3165@dottiness.seas.harvard.edu +20091117190054.GU3165@dottiness.seas.harvard.edu EOF test_expect_equal_file_nonempty EXPECTED OUTPUT @@ -62,8 +62,8 @@ test_expect_equal_file_nonempty EXPECTED OUTPUT test_begin_subtest "commit" notmuch git -C tags.git commit --force -git -C tags.git ls-tree -r --name-only HEAD | xargs dirname | sort -u | sed s,tags/,id:, > OUTPUT -notmuch search --output=messages '*' | sort > EXPECTED +git -C tags.git ls-tree -r --name-only HEAD | notmuch_git_sanitize | xargs dirname | sort -u > OUTPUT +notmuch search --output=messages '*' | sed s/^id:// | sort > EXPECTED test_expect_equal_file_nonempty EXPECTED OUTPUT test_begin_subtest "commit --force succeeds" @@ -88,22 +88,22 @@ test_expect_equal_file_nonempty BEFORE AFTER test_begin_subtest "commit (incremental)" notmuch tag +test id:20091117190054.GU3165@dottiness.seas.harvard.edu notmuch git -C tags.git commit -git -C tags.git ls-tree -r --name-only HEAD | +git -C tags.git ls-tree -r --name-only HEAD | notmuch_git_sanitize | \ grep 20091117190054 | sort > OUTPUT echo "--------------------------------------------------" >> OUTPUT notmuch tag -test id:20091117190054.GU3165@dottiness.seas.harvard.edu notmuch git -C tags.git commit -git -C tags.git ls-tree -r --name-only HEAD | +git -C tags.git ls-tree -r --name-only HEAD | notmuch_git_sanitize | \ grep 20091117190054 | sort >> OUTPUT cat < EXPECTED -tags/20091117190054.GU3165@dottiness.seas.harvard.edu/inbox -tags/20091117190054.GU3165@dottiness.seas.harvard.edu/signed -tags/20091117190054.GU3165@dottiness.seas.harvard.edu/test -tags/20091117190054.GU3165@dottiness.seas.harvard.edu/unread +20091117190054.GU3165@dottiness.seas.harvard.edu/inbox +20091117190054.GU3165@dottiness.seas.harvard.edu/signed +20091117190054.GU3165@dottiness.seas.harvard.edu/test +20091117190054.GU3165@dottiness.seas.harvard.edu/unread -------------------------------------------------- -tags/20091117190054.GU3165@dottiness.seas.harvard.edu/inbox -tags/20091117190054.GU3165@dottiness.seas.harvard.edu/signed -tags/20091117190054.GU3165@dottiness.seas.harvard.edu/unread +20091117190054.GU3165@dottiness.seas.harvard.edu/inbox +20091117190054.GU3165@dottiness.seas.harvard.edu/signed +20091117190054.GU3165@dottiness.seas.harvard.edu/unread EOF test_expect_equal_file_nonempty EXPECTED OUTPUT @@ -111,18 +111,18 @@ test_begin_subtest "commit (change prefix)" notmuch tag +test::one id:20091117190054.GU3165@dottiness.seas.harvard.edu notmuch git -C tags.git -p 'test::' commit --force git -C tags.git ls-tree -r --name-only HEAD | - grep 20091117190054 | sort > OUTPUT + grep 20091117190054 | notmuch_git_sanitize | sort > OUTPUT echo "--------------------------------------------------" >> OUTPUT notmuch tag -test::one id:20091117190054.GU3165@dottiness.seas.harvard.edu notmuch git -C tags.git commit --force -git -C tags.git ls-tree -r --name-only HEAD | +git -C tags.git ls-tree -r --name-only HEAD | notmuch_git_sanitize | \ grep 20091117190054 | sort >> OUTPUT cat < EXPECTED -tags/20091117190054.GU3165@dottiness.seas.harvard.edu/one +20091117190054.GU3165@dottiness.seas.harvard.edu/one -------------------------------------------------- -tags/20091117190054.GU3165@dottiness.seas.harvard.edu/inbox -tags/20091117190054.GU3165@dottiness.seas.harvard.edu/signed -tags/20091117190054.GU3165@dottiness.seas.harvard.edu/unread +20091117190054.GU3165@dottiness.seas.harvard.edu/inbox +20091117190054.GU3165@dottiness.seas.harvard.edu/signed +20091117190054.GU3165@dottiness.seas.harvard.edu/unread EOF test_expect_equal_file_nonempty EXPECTED OUTPUT @@ -151,12 +151,12 @@ test_expect_equal_file_nonempty BEFORE AFTER test_begin_subtest "archive" notmuch git -C tags.git archive | tar tf - | \ - grep 20091117190054.GU3165@dottiness.seas.harvard.edu | sort > OUTPUT + grep 20091117190054.GU3165@dottiness.seas.harvard.edu | notmuch_git_sanitize | sort > OUTPUT cat < EXPECTED -tags/20091117190054.GU3165@dottiness.seas.harvard.edu/ -tags/20091117190054.GU3165@dottiness.seas.harvard.edu/inbox -tags/20091117190054.GU3165@dottiness.seas.harvard.edu/signed -tags/20091117190054.GU3165@dottiness.seas.harvard.edu/unread +20091117190054.GU3165@dottiness.seas.harvard.edu/ +20091117190054.GU3165@dottiness.seas.harvard.edu/inbox +20091117190054.GU3165@dottiness.seas.harvard.edu/signed +20091117190054.GU3165@dottiness.seas.harvard.edu/unread EOF notmuch git -C tags.git checkout test_expect_equal_file EXPECTED OUTPUT diff --git a/test/test-lib.sh b/test/test-lib.sh index 59b6079d..dfbee365 100644 --- a/test/test-lib.sh +++ b/test/test-lib.sh @@ -545,6 +545,10 @@ notmuch_date_sanitize () { -e 's/^Date: Fri, 05 Jan 2001 .*0000/Date: GENERATED_DATE/' } +# remove redundant parts of notmuch-git internal paths +notmuch_git_sanitize () { + sed -e 's,tags/\([0-9a-f]\{2\}/\)\{2\},,' -e '/FORMAT/d' +} notmuch_uuid_sanitize () { sed 's/[0-9a-f]\{8\}-[0-9a-f]\{4\}-[0-9a-f]\{4\}-[0-9a-f]\{4\}-[0-9a-f]\{12\}/UUID/g' } -- 2.35.2