From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp11.migadu.com ([2001:41d0:2:bcc0::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms5.migadu.com with LMTPS id uDRxLkmVm2JJfgEAbAwnHQ (envelope-from ) for ; Sat, 04 Jun 2022 19:24:25 +0200 Received: from aspmx1.migadu.com ([2001:41d0:2:bcc0::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp11.migadu.com with LMTPS id mI1qLkmVm2KUnAAA9RJhRA (envelope-from ) for ; Sat, 04 Jun 2022 19:24:25 +0200 Received: from mail.notmuchmail.org (yantan.tethera.net [IPv6:2a01:4f9:c011:7a79::1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 5CC314589C for ; Sat, 4 Jun 2022 19:24:25 +0200 (CEST) Received: from yantan.tethera.net (localhost [127.0.0.1]) by mail.notmuchmail.org (Postfix) with ESMTP id 753605F7ED; Sat, 4 Jun 2022 17:23:44 +0000 (UTC) Received: from fethera.tethera.net (fethera.tethera.net [IPv6:2607:5300:60:c5::1]) by mail.notmuchmail.org (Postfix) with ESMTP id 21B955F7E8 for ; Sat, 4 Jun 2022 17:23:41 +0000 (UTC) Received: by fethera.tethera.net (Postfix, from userid 1001) id 9E0BD5FC48; Sat, 4 Jun 2022 13:23:40 -0400 (EDT) Received: (nullmailer pid 1150088 invoked by uid 1000); Sat, 04 Jun 2022 17:23:25 -0000 From: David Bremner To: notmuch@notmuchmail.org Subject: [PATCH v3 12/17] CLI/git: cache git indices Date: Sat, 4 Jun 2022 14:23:08 -0300 Message-Id: <20220604172313.1149879-13-david@tethera.net> X-Mailer: git-send-email 2.35.2 In-Reply-To: <20220604172313.1149879-1-david@tethera.net> References: <20220604172313.1149879-1-david@tethera.net> MIME-Version: 1.0 Message-ID-Hash: 5ASYIJZPQ7RTM3IHEBTOX27MBZ5OQZH2 X-Message-ID-Hash: 5ASYIJZPQ7RTM3IHEBTOX27MBZ5OQZH2 X-MailFrom: bremner@tethera.net X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-notmuch.notmuchmail.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.3 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Migadu-Flow: FLOW_IN X-Migadu-To: larch@yhetil.org X-Migadu-Country: DE ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1654363465; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-owner:list-unsubscribe:list-subscribe:list-post; bh=7O+Bz0SlCXNOE+A6awgjR08fpCK89/yiQacEwDfI9Eg=; b=V7WIMiSNMYYspfGBVpEGF6zkcvsUL+yzYlTTSbLWH3nvf8oupUKmUSmkPoXruQW/6AV0gU gbeLCW2swEEGoRHyZh+G3CEfh0dc1dLAfmDt6AplXf6wS+LH4M+evL6T21b7rFULMnN+Y7 ft6EU/KgAVytCX5ZGlMuZQVrETebBSqR6fjFhHXNPjKZbdeMDgX45dNpxa7+5dsV6Ju1h4 aq4jXglg2LBFJOwvSMF+UXHg+lXCHCvMG6ur3HpB75ZopNfG7SEsCgzzrJdS9gub27SEmo swNzlfIM+fxUY3tdx3Hm1THeAf0ZUeMkvgynJyuKEzJ2M04FEMrUnc0i1g5F0g== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1654363465; a=rsa-sha256; cv=none; b=G80pDeGBw/RFxc09LWakzXPumFLvKv1YQ6h4oKdPsQXeWvPa3+83gKiEG1e+g2vFSCSEjV xRX33zStV+C0TCSI/LgH53K6FAlMhz+WbaOB+INWTnqek6szJdEAdYFoAFvWwYnyaJOS07 WzVnwyLv1oYPRPcWmnck4YDxQIEf/LJVa2rbEyjIVtLh0X6u/pZCBHqhX49hpWN+sCG2sm OwEl8AIWsBTouWNmVVX7LEwCyfbVsNtDBmlqo2oNn14zA7w2z/cW9aP9AKgRf4yHT5wvpk j4DCn3dWFw3/5agOrIXMZIG7Y/+0QaVbiapmg2NNQvXxT/XrssNN+H0U4DXhXQ== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=none; dmarc=none; spf=pass (aspmx1.migadu.com: domain of notmuch-bounces@notmuchmail.org designates 2a01:4f9:c011:7a79::1 as permitted sender) smtp.mailfrom=notmuch-bounces@notmuchmail.org X-Migadu-Spam-Score: -1.05 Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=none; spf=pass (aspmx1.migadu.com: domain of notmuch-bounces@notmuchmail.org designates 2a01:4f9:c011:7a79::1 as permitted sender) smtp.mailfrom=notmuch-bounces@notmuchmail.org X-Migadu-Queue-Id: 5CC314589C X-Spam-Score: -1.05 X-Migadu-Scanner: scn1.migadu.com X-TUID: d+mVPHgMB4FT If the private index file matches a previously known revision of the database, we can update the index incrementally using the recorded lastmod counter. This is typically much faster than a full update, although it could be slower in the case of large changes to the database. The "git-read-tree HEAD" is also a bottleneck, but unfortunately sometimes is needed. Cache the index checksum and hash to reduce the number of times the operation is run. The overall design is a simplified version of the PrivateIndex class. --- notmuch-git.py | 318 ++++++++++++++++++++++++++++++++++------------- test/T850-git.sh | 41 ++++++ 2 files changed, 274 insertions(+), 85 deletions(-) diff --git a/notmuch-git.py b/notmuch-git.py index 6ea50fe8..e419cdf7 100644 --- a/notmuch-git.py +++ b/notmuch-git.py @@ -38,6 +38,7 @@ import tempfile as _tempfile import textwrap as _textwrap from urllib.parse import quote as _quote from urllib.parse import unquote as _unquote +import json as _json _LOG = _logging.getLogger('nmbug') _LOG.setLevel(_logging.WARNING) @@ -299,41 +300,98 @@ def _is_committed(status): return len(status['added']) + len(status['deleted']) == 0 +class CachedIndex: + def __init__(self, repo, treeish): + self.cache_path = _os.path.join(repo, 'notmuch', 'index_cache.json') + self.index_path = _os.path.join(repo, 'index') + self.current_treeish = treeish + # cached values + self.treeish = None + self.hash = None + self.index_checksum = None + + self._load_cache_file() + + def _load_cache_file(self): + try: + with open(self.cache_path) as f: + data = _json.load(f) + self.treeish = data['treeish'] + self.hash = data['hash'] + self.index_checksum = data['index_checksum'] + except FileNotFoundError: + pass + except _json.JSONDecodeError: + _LOG.error("Error decoding cache") + _sys.exit(1) + + def __enter__(self): + self.read_tree() + return self + + def __exit__(self, type, value, traceback): + checksum = _read_index_checksum(self.index_path) + (_, hash, _) = _git( + args=['rev-parse', self.current_treeish], + stdout=_subprocess.PIPE, + wait=True) + + with open(self.cache_path, "w") as f: + _json.dump({'treeish': self.current_treeish, + 'hash': hash.rstrip(), 'index_checksum': checksum }, f) + + @timed + def read_tree(self): + current_checksum = _read_index_checksum(self.index_path) + (_, hash, _) = _git( + args=['rev-parse', self.current_treeish], + stdout=_subprocess.PIPE, + wait=True) + current_hash = hash.rstrip() + + if self.current_treeish == self.treeish and \ + self.index_checksum and self.index_checksum == current_checksum and \ + self.hash and self.hash == current_hash: + return + + _git(args=['read-tree', self.current_treeish], wait=True) + + def commit(treeish='HEAD', message=None): """ Commit prefix-matching tags from the notmuch database to Git. """ + status = get_status() if _is_committed(status=status): _LOG.warning('Nothing to commit') return - _git(args=['read-tree', '--empty'], wait=True) - _git(args=['read-tree', treeish], wait=True) - try: - _update_index(status=status) - (_, tree, _) = _git( - args=['write-tree'], - stdout=_subprocess.PIPE, - wait=True) - (_, parent, _) = _git( - args=['rev-parse', treeish], - stdout=_subprocess.PIPE, - wait=True) - (_, commit, _) = _git( - args=['commit-tree', tree.strip(), '-p', parent.strip()], - input=message, - stdout=_subprocess.PIPE, - wait=True) - _git( - args=['update-ref', treeish, commit.strip()], - stdout=_subprocess.PIPE, - wait=True) - except Exception as e: - _git(args=['read-tree', '--empty'], wait=True) - _git(args=['read-tree', treeish], wait=True) - raise + with CachedIndex(NOTMUCH_GIT_DIR, treeish) as index: + try: + _update_index(status=status) + (_, tree, _) = _git( + args=['write-tree'], + stdout=_subprocess.PIPE, + wait=True) + (_, parent, _) = _git( + args=['rev-parse', treeish], + stdout=_subprocess.PIPE, + wait=True) + (_, commit, _) = _git( + args=['commit-tree', tree.strip(), '-p', parent.strip()], + input=message, + stdout=_subprocess.PIPE, + wait=True) + _git( + args=['update-ref', treeish, commit.strip()], + stdout=_subprocess.PIPE, + wait=True) + except Exception as e: + _git(args=['read-tree', '--empty'], wait=True) + _git(args=['read-tree', treeish], wait=True) + raise @timed def _update_index(status): @@ -582,50 +640,160 @@ def get_status(): 'deleted': {}, 'missing': {}, } - index = _index_tags() - maybe_deleted = _diff_index(index=index, filter='D') - for id, tags in maybe_deleted.items(): - (_, stdout, stderr) = _spawn( - args=['notmuch', 'search', '--output=files', 'id:{0}'.format(id)], - stdout=_subprocess.PIPE, - wait=True) - if stdout: - status['deleted'][id] = tags - else: - status['missing'][id] = tags - status['added'] = _diff_index(index=index, filter='A') - _os.remove(index) + with PrivateIndex(repo=NOTMUCH_GIT_DIR, prefix=TAG_PREFIX) as index: + maybe_deleted = index.diff(filter='D') + for id, tags in maybe_deleted.items(): + (_, stdout, stderr) = _spawn( + args=['notmuch', 'search', '--output=files', 'id:{0}'.format(id)], + stdout=_subprocess.PIPE, + wait=True) + if stdout: + status['deleted'][id] = tags + else: + status['missing'][id] = tags + status['added'] = index.diff(filter='A') + return status -@timed -def _index_tags(): - "Write notmuch tags to the nmbug.index." - path = _os.path.join(NOTMUCH_GIT_DIR, 'nmbug.index') - prefix = '+{0}'.format(_ENCODED_TAG_PREFIX) - _git( - args=['read-tree', '--empty'], - additional_env={'GIT_INDEX_FILE': path}, wait=True) - with _spawn( - args=['notmuch', 'dump', '--format=batch-tag', '--query=sexp', '--', _tag_query()], - stdout=_subprocess.PIPE) as notmuch: +class PrivateIndex: + def __init__(self, repo, prefix): + try: + _os.makedirs(_os.path.join(repo, 'notmuch')) + except FileExistsError: + pass + + file_name = 'notmuch/index' + self.index_path = _os.path.join(repo, file_name) + self.cache_path = _os.path.join(repo, 'notmuch', '{:s}.json'.format(_hex_quote(file_name))) + + self.current_prefix = prefix + + self.prefix = None + self.uuid = None + self.lastmod = None + self.checksum = None + self._load_cache_file() + self._index_tags() + + def __enter__(self): + return self + + def __exit__(self, type, value, traceback): + checksum = _read_index_checksum(self.index_path) + (count, uuid, lastmod) = _read_database_lastmod() + with open(self.cache_path, "w") as f: + _json.dump({'prefix': self.current_prefix, 'uuid': uuid, 'lastmod': lastmod, 'checksum': checksum }, f) + + def _load_cache_file(self): + try: + with open(self.cache_path) as f: + data = _json.load(f) + self.prefix = data['prefix'] + self.uuid = data['uuid'] + self.lastmod = data['lastmod'] + self.checksum = data['checksum'] + except FileNotFoundError: + return None + except _json.JSONDecodeError: + _LOG.error("Error decoding cache") + _sys.exit(1) + + @timed + def _index_tags(self): + "Write notmuch tags to private git index." + prefix = '+{0}'.format(_ENCODED_TAG_PREFIX) + current_checksum = _read_index_checksum(self.index_path) + if (self.prefix == None or self.prefix != self.current_prefix + or self.checksum == None or self.checksum != current_checksum): + _git( + args=['read-tree', '--empty'], + additional_env={'GIT_INDEX_FILE': self.index_path}, wait=True) + + query = _tag_query() + clear_tags = False + (count,uuid,lastmod) = _read_database_lastmod() + if self.prefix == self.current_prefix and self.uuid \ + and self.uuid == uuid and self.checksum == current_checksum: + query = '(and (infix "lastmod:{:d}..")) {:s})'.format(self.lastmod+1, query) + clear_tags = True + with _spawn( + args=['notmuch', 'dump', '--format=batch-tag', '--query=sexp', '--', query], + stdout=_subprocess.PIPE) as notmuch: + with _git( + args=['update-index', '--index-info'], + stdin=_subprocess.PIPE, + additional_env={'GIT_INDEX_FILE': self.index_path}) as git: + for line in notmuch.stdout: + if line.strip().startswith('#'): + continue + (tags_string, id) = [_.strip() for _ in line.split(' -- id:')] + tags = [ + _unquote(tag[len(prefix):]) + for tag in tags_string.split() + if tag.startswith(prefix)] + id = _xapian_unquote(string=id) + if clear_tags: + for line in _clear_tags_for_message(index=self.index_path, id=id): + git.stdin.write(line) + for line in _index_tags_for_message( + id=id, status='A', tags=tags): + git.stdin.write(line) + + @timed + def diff(self, filter): + """ + Get an {id: {tag, ...}} dict for a given filter. + + For example, use 'A' to find added tags, and 'D' to find deleted tags. + """ + s = _collections.defaultdict(set) with _git( - args=['update-index', '--index-info'], - stdin=_subprocess.PIPE, - additional_env={'GIT_INDEX_FILE': path}) as git: - for line in notmuch.stdout: - if line.strip().startswith('#'): - continue - (tags_string, id) = [_.strip() for _ in line.split(' -- id:')] - tags = [ - _unquote(tag[len(prefix):]) - for tag in tags_string.split() - if tag.startswith(prefix)] - id = _xapian_unquote(string=id) - for line in _index_tags_for_message( - id=id, status='A', tags=tags): - git.stdin.write(line) - return path + args=[ + 'diff-index', '--cached', '--diff-filter', filter, + '--name-only', 'HEAD'], + additional_env={'GIT_INDEX_FILE': self.index_path}, + stdout=_subprocess.PIPE) as p: + # Once we drop Python < 3.3, we can use 'yield from' here + for id, tag in _unpack_diff_lines(stream=p.stdout): + s[id].add(tag) + return s + +def _read_index_checksum (index_path): + """Read the index checksum, as defined by index-format.txt in the git source + WARNING: assumes SHA1 repo""" + import binascii + try: + with open(index_path, 'rb') as f: + size=_os.path.getsize(index_path) + f.seek(size-20); + return binascii.hexlify(f.read(20)).decode('ascii') + except FileNotFoundError: + return None + + +def _clear_tags_for_message(index, id): + """ + Clear any existing index entries for message 'id' + + Neither 'id' nor the tags in 'tags' should be encoded/escaped. + """ + dir = 'tags/{id}'.format(id=_hex_quote(string=id)) + + with _git( + args=['ls-files', dir], + additional_env={'GIT_INDEX_FILE': index}, + stdout=_subprocess.PIPE) as git: + for file in git.stdout: + line = '0 0000000000000000000000000000000000000000\t{:s}\n'.format(file.strip()) + yield line + +def _read_database_lastmod(): + with _spawn( + args=['notmuch', 'count', '--lastmod', '*'], + stdout=_subprocess.PIPE) as notmuch: + (count,uuid,lastmod_str) = notmuch.stdout.readline().split() + return (count,uuid,int(lastmod_str)) def _index_tags_for_message(id, status, tags): """ @@ -646,26 +814,6 @@ def _index_tags_for_message(id, status, tags): yield '{mode} {hash}\t{path}\n'.format(mode=mode, hash=hash, path=path) -@timed -def _diff_index(index, filter): - """ - Get an {id: {tag, ...}} dict for a given filter. - - For example, use 'A' to find added tags, and 'D' to find deleted tags. - """ - s = _collections.defaultdict(set) - with _git( - args=[ - 'diff-index', '--cached', '--diff-filter', filter, - '--name-only', 'HEAD'], - additional_env={'GIT_INDEX_FILE': index}, - stdout=_subprocess.PIPE) as p: - # Once we drop Python < 3.3, we can use 'yield from' here - for id, tag in _unpack_diff_lines(stream=p.stdout): - s[id].add(tag) - return s - - def _diff_refs(filter, a='HEAD', b='@{upstream}'): with _git( args=['diff', '--diff-filter', filter, '--name-only', a, b], diff --git a/test/T850-git.sh b/test/T850-git.sh index 72091b56..db76dae9 100755 --- a/test/T850-git.sh +++ b/test/T850-git.sh @@ -33,6 +33,47 @@ notmuch tag '-"quoted tag"' '*' git -C clone2.git ls-tree -r --name-only HEAD | grep /inbox > AFTER test_expect_equal_file_nonempty BEFORE AFTER +test_begin_subtest "commit (incremental)" +notmuch tag +test id:20091117190054.GU3165@dottiness.seas.harvard.edu +notmuch git -C tags.git -p '' commit +git -C tags.git ls-tree -r --name-only HEAD | + grep 20091117190054 | sort > OUTPUT +echo "--------------------------------------------------" >> OUTPUT +notmuch tag -test id:20091117190054.GU3165@dottiness.seas.harvard.edu +notmuch git -C tags.git -p '' commit +git -C tags.git ls-tree -r --name-only HEAD | + grep 20091117190054 | sort >> OUTPUT +cat < EXPECTED +tags/20091117190054.GU3165@dottiness.seas.harvard.edu/inbox +tags/20091117190054.GU3165@dottiness.seas.harvard.edu/signed +tags/20091117190054.GU3165@dottiness.seas.harvard.edu/test +tags/20091117190054.GU3165@dottiness.seas.harvard.edu/unread +-------------------------------------------------- +tags/20091117190054.GU3165@dottiness.seas.harvard.edu/inbox +tags/20091117190054.GU3165@dottiness.seas.harvard.edu/signed +tags/20091117190054.GU3165@dottiness.seas.harvard.edu/unread +EOF +test_expect_equal_file_nonempty EXPECTED OUTPUT + +test_begin_subtest "commit (change prefix)" +notmuch tag +test::one id:20091117190054.GU3165@dottiness.seas.harvard.edu +notmuch git -C tags.git -p 'test::' commit +git -C tags.git ls-tree -r --name-only HEAD | + grep 20091117190054 | sort > OUTPUT +echo "--------------------------------------------------" >> OUTPUT +notmuch tag -test::one id:20091117190054.GU3165@dottiness.seas.harvard.edu +notmuch git -C tags.git -p '' commit +git -C tags.git ls-tree -r --name-only HEAD | + grep 20091117190054 | sort >> OUTPUT +cat < EXPECTED +tags/20091117190054.GU3165@dottiness.seas.harvard.edu/one +-------------------------------------------------- +tags/20091117190054.GU3165@dottiness.seas.harvard.edu/inbox +tags/20091117190054.GU3165@dottiness.seas.harvard.edu/signed +tags/20091117190054.GU3165@dottiness.seas.harvard.edu/unread +EOF +test_expect_equal_file_nonempty EXPECTED OUTPUT + test_begin_subtest "checkout" notmuch dump > BEFORE notmuch tag -inbox '*' -- 2.35.2