From: Jorge P. de Morais Neto <jorge+list@disroot.org>
To: notmuch@notmuchmail.org
Subject: Please discard---Corrupted database (notmuch 0.31, Emacs 27.1.50, afew 3.0.1, OfflineIMAP, Python)
Date: Wed, 18 Nov 2020 10:45:32 -0300 [thread overview]
Message-ID: <875z62kccj.fsf@disroot.org> (raw)
In-Reply-To: <87lfezpz9t.fsf@disroot.org>
Moderator, please discard the email cited below and mentioned in the
In-Reply-To header: Message-ID 87lfezpz9t.fsf@disroot.org. Today I
submitted another copy (with the large attachment replaced by an URL to
a file upload service) and it already wen through.
Regards
Em [2020-11-17 ter 16:19:10-0300], Jorge P. de Morais Neto escreveu:
> Hi. I use Notmuch 0.31.2 on Emacs 27.1.50 (manually compiled on
> 2020-11-02) with matching version-pinned MELPA Stable Notmuch package on
> updated Debian buster. I have enabled buster-proposed-updates,
> buster-updates and buster-backports. I manually backport notmuch
> according to <https://wiki.debian.org/SimpleBackportCreation>. I use
> OfflineIMAP 7.3.3 (Python 2 pip), afew 3.0.1 (pip3), Bogofilter 1.2.4
> (buster) and a custom Python 3 script based on the ~notmuch~ module.
>
> Yesterday (when still on Notmuch 0.31) I noticed that, when I tagged a
> message or thread, the fido-mode completion offered many weird candidate
> tags that shouldn't exist in the database. Also, on the Notmuch Hello
> screen the ~All tags~ section would error out. I then dumped the
> database (~notmuch dump~) and noticed many lines associating weird tags
> to weird message ids.
>
> In almost every case, both the weird tags and the weird Message-Id
> contained uncommon characters, often ASCII control characters. One of
> the weird lines was " -- id:8"---specifying a message with Messaged-ID
> "8" and no tags. I tried ~notmuch show id:8~ and got an internal
> error---something like "message with document ID <SOME_NUMBER> has no
> thread ID".
>
> I then upgraded Notmuch to 0.31.2 and compacted the database but the
> error persisted. I then manually cleaned up the database dump, deleted
> the ~/offlineimap/Jorge-Disroot/.notmuch/xapian/ directory, invoked
> ~notmuch new~, and ~notmuch restore~. I checked my backups from
> 2020-11-09 (no corruption) and 2011-11-16. That latest backup was from
> before I /noticed/ the corruption, but it was affected too. I then
> diffed backup 2020-11-09 with backup 2020-11-16; and then backup
> 2020-11-16 with the current dump. The diffs suggest that the error
> involved only the addition of invalid information; I suspect and hope
> that valid information was not lost.
>
> I attached my post-new Bash script and the Python 3 script it invokes.
> So you can see the weird lines I mentioned, I also attached the
> xz-compressed output of the command:
>
> diff -u notmuch_dump--manually_fixed notmuch_dump--corrupted > diff_notmuch_dump__manually_fixed--corrupted
>
> I have also saved the binary corrupted database. If you want to see it,
> then tell me and I may upload it to Disroot's Lufi instance. It should
> probably be shown to as few people as possible for the sake of my
> privacy.
>
> Finally, my notmuch config includes the following directives (the other
> directives are probably irrelevant to you):
>
> [new]
> tags=new
> ignore=
>
> [search]
> exclude_tags=deleted;spam;trash
>
> [maildir]
> synchronize_flags=true
>
> Regards
>
> #!/usr/bin/env python3
> """Mail filter (including anti-spam) and notifier for Notmuch.
>
> Track messages classified as spam (or ham) by Bogofilter via '.bf_spam'
> (resp. '.bf_ham' ) tags. Since afew removes the `new' tag, when
> notifying mail we track new messages with a temporary tag (option
> '--tmp' of `filter' subcommand) which we assume not to preexist in the
> database. These tags and that added by the user to spam messages can be
> customized via command-line options or, from Python, by modifying
> module-level constants or by via function arguments.
>
> This script is potentially affected by environment variables, files and
> directories that affect afew, Bogofilter, Notmuch or (obviously)
> Python3, including:
> 1. `NOTMUCH_CONFIG' – location of Notmuch configuration file – and that
> file itself.
> 2. `BOGOFILTER_DIR' – location of Bogofilter's database directory – and
> that directory itself.
> 3. afew configuration.
>
> WISH: Accept customizable "new" flags (currently we assume "new").
> """
> # WISH: Finish documenting the exceptions possibly raised by each function
> import logging
> import sys
> import time
> from argparse import ArgumentDefaultsHelpFormatter, ArgumentParser, Namespace
> from functools import partial
> from logging import handlers
> from subprocess import PIPE, STDOUT, Popen, run
> from typing import Any, Callable, Iterable, Optional, Tuple, Union
>
> from notmuch import Database, Message
>
> # https://wiki.archlinux.org/index.php/Desktop_notifications#Python
> import gi # isort:skip
> gi.require_version('Notify', '0.7')
> # pylint: disable=wrong-import-position
> from gi.repository import Notify # noqa: E402 isort:skip
>
> Tags = Union[str, Iterable[str]]
>
> LOG = logging.getLogger(__name__)
> FILTER_ACTIONS = {'spam', 'general', 'notify'}
>
> # Defaults for command-line options
> BF_HAM = '.bf_ham'
> BF_SPAM = '.bf_spam'
> USER_SPAM = 'spam'
> TMP = '_simplemuch_tmp'
>
>
> class SimplemuchError(Exception):
> """Base class for simplemuch exception classes"""
>
>
> class NotmuchDatabaseNeedsUpgradeError(SimplemuchError):
> """needs_upgrade() returned True."""
>
>
> # WISH Capture more information, e.g. return code and command line
> class BogofilterError(SimplemuchError):
> """Error from Bogofilter"""
>
>
> # def teste_mypy(i: int) -> None:
> # return i + ''
>
> def alert(summary: str,
> body: str,
> *args: Any,
> fun: Callable[..., None] = LOG.warning) -> None:
> """Show desktop notification -- `summary', `body' -- and log.
>
> Logs with fun(body, *args).
> """
> if fun in (LOG.exception, LOG.error):
> kwargs = {'icon': 'dialog-error'}
> elif fun in (LOG.warn, LOG.warning):
> kwargs = {'icon': 'dialog-warning'}
> else:
> kwargs = {}
> Notify.Notification.new(summary, body % args, **kwargs).show()
> fun(body, *args)
>
>
> def safe_open_db_rw() -> Database:
> """Open Notmuch database for reading and writing and return it.
>
> Before returning, check if the database needs upgrade; if so, raise
> NotmuchDatabaseNeedsUpgradeError.
> """
> nm_db = Database(mode=Database.MODE.READ_WRITE)
> if nm_db.needs_upgrade():
> raise NotmuchDatabaseNeedsUpgradeError(
> 'Notmuch database needs upgrade. Exiting without action.\n'
> 'WISH Implement correct database upgrading')
> return nm_db
>
>
> def update(nm_db: Database, args: Namespace, query: str,
> opr: str) -> Tuple[int, float]:
> """Call bogofilter on messages matching `query', change their tags.
>
> Call `bogofilter' with command-line option `opr' (plus -bl) and feed
> it (via stdin) the filenames of messages matching Notmuch query
> `query'. For each such message, apply the corresponding tag change
> (according to `args.bf_spam' and `args.bf_ham'). `opr' must be in
> set('SsNn') (see bogofilter(1) for the meaning). Return the number
> of messages operated on and the elapsed time.
>
> This function is potentially affected by environment variables,
> files and directories that affect Bogofilter or Notmuch.
>
> TODO Handle bogofilter errors
> """
> start = time.time()
> assert opr in set('SsNn')
> tag_ = args.bf_spam if opr in 'sS' else args.bf_ham
> if opr in 'sn':
> def tag(msg: Message) -> None:
> msg.add_tag(tag_)
> else:
> def tag(msg: Message) -> None:
> msg.remove_tag(tag_)
> num = 0
> with Popen(('bogofilter', '-bl' + opr), stdin=PIPE, text=True,
> bufsize=1) as bogo:
> assert bogo.stdin # Placate mypy
> for msg in nm_db.create_query(query).search_messages():
> bogo.stdin.write(msg.get_filename() + '\n')
> tag(msg)
> num += 1
> if bogo.returncode:
> raise BogofilterError(f'Bogofilter returned {bogo.returncode}')
> return num, time.time() - start
>
>
> def train(args: Namespace) -> None:
> """Train Bogofilter on the Notmuch database.
>
> According to how the user classified the given message (spam or
> ham), update Simplemuch tags (`args.bf_spam' and `args.bf_ham') and
> Bogofilter's database. We assume the user classified a message as
> spam if it is tagged `args.user_spam'; and he classified it as ham
> if it has been read but not tagged `args.user_spam'.
>
> Therefore we assume that:
> 1. Messages tagged `args.user_spam' are in fact spam.
> 2. Spammy read messages are tagged `args.user_spam'.
> 3. Messages tagged `args.bf_spam' are also tagged `args.user_spam',
> unless they are false positives.
>
> A problematic scenario is when the user reads spam in webmail but
> forgets to tag it as spam in Notmuch.
>
> This function is potentially affected by environment variables,
> files and directories that affect Bogofilter or Notmuch.
> """
> with safe_open_db_rw() as nm_db:
> def train_(query: str, opr: str, obj: str) -> None:
> assert opr in set('SsNn')
> opr_ = 'Register' if opr in 'sn' else 'Unregister'
> end = f'{opr_}ed %d {obj} in %.3gs'
> LOG.info('%s %s', opr_, obj)
> num, dur = update(nm_db, args, query, opr)
> LOG.info(end, num, dur)
>
> bf_spam, bf_ham, user_spam = args.bf_spam, args.bf_ham, args.user_spam
> train_(f'is:{user_spam} NOT is:{bf_spam}', 's', 'spam')
> train_(f'is:{bf_spam} NOT is:{user_spam}', 'S', '(false) spam')
> train_(f'NOT (is:{user_spam} is:unread is:{bf_ham})', 'n', 'ham')
> train_(f'is:{user_spam} AND is:{bf_ham}', 'N', '(false) ham')
>
>
> def count(nm_db: Database, query: str, exclude: Tags = ()) -> int:
> """Return Xapian’s best guess as to how many messages match `query'.
>
> `exclude', if provided, must contain tags to exclude from the count
> by default. A given tag will not be excluded if it appears
> explicitly in `query'.
>
> May raise:
> - `NullPointerError' if the query creation failed (e.g. too little
> memory).
> - `NotInitializedError' if the underlying db was not initialized.
>
> This function is potentially affected by environment variables,
> files and directories that affect Notmuch.
>
> WISH Find out and document what "best guess" means; this wording is
> from the documentation of notmuch Python bindings.
> """
> query_ = nm_db.create_query(query)
> if isinstance(exclude, str):
> query_.exclude_tag(exclude)
> else:
> for tag in exclude:
> query_.exclude_tag(tag)
> return query_.count_messages()
>
>
> def filter_spam(nm_db: Database, query: str, ham: Optional[str] = None,
> spam: Optional[str] = None) -> None:
> """Filter (Bogofilter) the messages matching Notmuch query `query'.
>
> If Bogofilter classifies a given message as Spam/Ham then tag it
> `spam'/`ham' (defaulting to `BF_SPAM'/`BF_HAM').
>
> This function is potentially affected by environment variables,
> files and directories that affect Bogofilter or Notmuch.
> """
> tag = dict(H=ham or BF_HAM, S=spam or BF_SPAM)
> with Popen(('bogofilter', '-blT'), stdin=PIPE, stdout=PIPE, text=True,
> bufsize=1) as bogo:
> assert bogo.stdin and bogo.stdout # Placate mypy
> for msg in nm_db.create_query(query).search_messages():
> bogo.stdin.write(msg.get_filename() + '\n')
> code = bogo.stdout.readline().split()[-2]
> if code != 'U':
> msg.add_tag(tag[code])
> msg_id = msg.get_message_id()
> LOG.debug('Message %s marked %s', msg_id, tag[code])
>
>
> def tag_search(nm_db: Database, query: str, add: Tags = (),
> remove: Tags = ()) -> None:
> """Add/remove tags from messages matching Notmuch `query'.
>
> `nm_db' must be open for reading and writing. `query' should be a
> Notmuch query on whose results we should act. Operate atomically on
> the set of messages matching `query'.
>
> May raise:
> - `XapianError' – see documentation of `begin_atomic()' and
> `end_atomic()' methods of `Database'
> - `NullPointerError' if notmuch query creation failed (e.g. too
> little memory) or `search_messages()' failed
> - `NotInitializedError' if the underlying db was not initialized
> - `NullPointerError' if a given tag is NULL
> - `TagTooLongError' if the length of a given tag exceeds
> notmuch.Message.NOTMUCH_TAG_MAX)
> - `ReadOnlyDatabaseError' if the database was opened in read-only
> mode
> - `NotInitializedError' if message has not been initialized
>
> This function is potentially affected by environment variables,
> files and directories that affect Notmuch.
> """
> nm_db.begin_atomic()
> for msg in nm_db.create_query(query).search_messages():
> if isinstance(add, str):
> msg.add_tag(add)
> else:
> for tag in add:
> msg.add_tag(tag)
> if isinstance(remove, str):
> msg.remove_tag(remove)
> else:
> for tag in remove:
> msg.remove_tag(tag)
> nm_db.end_atomic()
>
>
> def filter_notify(args: Namespace) -> None:
> """Filter mail (afew, Bogofilter and Notmuch) and notify.
>
> - `args.req' must be a container with elements of FILTER_ACTIONS we
> should act on (requested actions).
> - If \"args.req['spam']\" is True then `args.query' must be a string
> representing a Notmuch query (on whose results the spam filter
> will work) and `args.bf_ham', `args.bf_spam' must be the tags to
> add to messages that Bogofilter classifies as ham (resp. spam).
> - If `args.req' includes 'notify', we internally use a temporary tag
> – args.tmp – that we assume not to preexist in the Notmuch database.
>
> This function is potentially affected by environment variables,
> files and directories that affect afew, Bogofilter or Notmuch.
>
> TODO Document the required Notmuch saved queries.
> TODO Document the DKIM filtering.
> """
> if args.req['general'] or args.req['notify'] or args.req['spam']:
> with safe_open_db_rw() as nm_db:
> if args.req['spam']:
> filter_spam(nm_db, args.query, args.bf_ham, args.bf_spam)
> if args.req['general'] or args.req['notify']:
> # Afew will remove 'new'
> tag_search(nm_db, 'is:new', args.tmp)
> tmp_count = count(nm_db, f'is:{args.tmp}')
> pipe = partial(run, stdout=PIPE, text=True)
> try:
> if args.req['general'] or args.req['notify']:
> exclude = pipe(
> ('notmuch', 'config', 'get', 'search.exclude_tags'),
> check=True).stdout.rstrip('\n').split('\n')
> if args.req['general']:
> afew = pipe(('afew', '-tnv'), check=True, stderr=STDOUT)
> LOG.info('\n%s', afew.stdout)
> exclude_dkim = '(%s)' % ' OR '.join(
> (f'is:{tag}' for tag in exclude + ['/dkim-.*/']))
> # problem = ('1584638185559.1b10c882-e1e1-4993-8f01-bdbcb3b4afe2@'
> # '302036m.grancursosonline.com.br')
> dkim_query = f'(is:{args.tmp} -{exclude_dkim})'
> afew = pipe(('afew', '-tv', '-eDKIMValidityFilter', dkim_query),
> stderr=STDOUT)
> if afew.returncode:
> alert('DKIM filter error',
> 'afew DKIMValidityFilter returned %d:\n%s',
> afew.returncode, afew.stdout)
> else:
> LOG.info('\n%s', afew.stdout)
> if args.req['general'] or args.req['notify']:
> with safe_open_db_rw() as nm_db:
> if args.req['notify']:
> p_count = partial(count, nm_db, exclude=exclude)
> tmp_notify = f'is:{args.tmp} query:simplemuch_notify'
> notify = p_count(tmp_notify)
> if notify:
> unread = p_count("query:simplemuch_unread")
> inbox_unread = p_count("query:simplemuch_INBOX_unread")
> flagged = p_count("query:simplemuch_flagged")
> body = (f'\
> Inbox: {unread} unread ({inbox_unread} INBOX), {flagged} flagged\n' +
> '\n'.join(msg.get_header('Subject') for msg in
> nm_db.create_query(
> tmp_notify).search_messages()))
> summary = f'{notify} new messages.'
> Notify.Notification.new(
> summary, body, 'mail-message-new').show()
> tag_search(nm_db, 'is:' + args.tmp, remove=args.tmp)
> tmp_count = 0
> finally:
> if (args.req['notify'] or args.req['general']) and tmp_count:
> body_fmt = '%d messages left tagged %s'
> alert('Dirty messages', body_fmt, tmp_count, args.tmp)
>
>
> # Commented out since I don't know a simple way to obtain the location of the
> # Bogofilter directory. It may not be `(~/.bogofilter)': see Bogofilter
> # man page section `ENVIRONMENT'. Maybe the `-x' flag can help.
> # def clean(db, args):
> # """Remove Bogofilter tags from all messages and remove `(~/.bogofilter)'"""
> # if not shutil.rmtree.avoids_symlink_attacks:
> # print("Warning: this `shutil.rmtree' is susceptible to symlink attacks.")
> # while True:
> # reply = input(prompt=
> # f"""Remove Bogofilter database directory and, from all Notmuch email messages,
> # {args.bf_spam} and {args.bf_ham} tags? [y/N] """).lower()
> # if 'no'.startswith(reply):
> # return False
> # if 'yes'.startswith(reply):
> # shutil.rmtree(os.path.expanduser('~/.bogofilter'))
> # tag_search(db, f'is:{args.bf_spam}', remove='f{args.bf_spam}')
> # tag_search(db, f'is:{args.bf_ham}', remove=f'{args.bf_ham}')
> # return True
> # print(
> # 'Please provide a valid answer: "yes", "no" or a prefix, '
> # 'case-insensitive', file=sys.stderr)
>
> def parse_command_line() -> Namespace:
> """Parse sys.argv into a Namespace object"""
> parser = ArgumentParser(
> description=__doc__,
> formatter_class=ArgumentDefaultsHelpFormatter)
> parser.add_argument(
> '--version', action='version', version='Simplemuch alpha')
> parser.add_argument('-v', '--verbose', action='store_true',
> help='Output log messages also to stderr')
> parser.add_argument(
> '--bf_spam', default=BF_SPAM, metavar='TAG',
> help='Tag for bogofilter-flagged spam')
> parser.add_argument(
> '--bf_ham', default=BF_HAM, metavar='TAG',
> help='Tag for bogofilter-flagged ham')
> parser.add_argument(
> '--user_spam', default=USER_SPAM, metavar='TAG',
> help='Tag for user-flagged spam')
> parser.add_argument(
> '--loglevel', default='INFO', help="""\
> Severity threshold for logging; logging messages less severe are discarded.
> For the allowed values see
> https://docs.python.org/3/howto/logging.html""")
> subparsers = parser.add_subparsers(
> title='Subcommands', required=True, description='Specify exactly one')
> parser_filter = subparsers.add_parser(
> 'filter', help="""Filter mail. By default (see `--skip'), filter out
> spam, then do general mail filtering (with afew) and then, depending on
> the new messages, notify.""")
> parser_filter.add_argument(
> '--skip', choices=FILTER_ACTIONS, nargs='+', help='Actions to skip',
> default=())
> # WISH: append a random suffix
> parser_filter.add_argument(
> '--tmp', metavar='TEMPORARY_TAG', default=TMP,
> help='Temporary tag for internal use; assumed by this script'
> ' not to preexist in the database')
> parser_filter.add_argument(
> 'query', nargs='?', default='is:new',
> help='The Notmuch query whose result will be spam-filtered')
> parser_filter.set_defaults(func=filter_notify)
> parser_train = subparsers.add_parser(
> 'train', help="""Train bogofilter. We assume the user
> classified a message as spam if it is tagged `args.user_spam';
> and he classified a message as ham if it has been read but
> not tagged `args.user_spam'. Therefore we assume that:
>
> 1. Messages tagged `args.user_spam' are in fact spam.
> 2. Spammy read messages are tagged `args.user_spam'.
> 3. Messages tagged `args.bf_spam' are also tagged `args.user_spam',
> unless they are false positives.
>
> A problematic scenario is when the user reads spam in webmail
> but forgets to tag it spam in Notmuch.""")
> parser_train.set_defaults(func=train)
> # parser_clean = subparsers.add_parser(
> # 'clean',
> # help="Remove Bogofilter tags from all messages and remove "
> # "`(~/.bogofilter)'")
> # parser_clean.set_defaults(func=clean)
> args = parser.parse_args()
> args.req = {a: args.func is filter_notify and a not in args.skip
> for a in FILTER_ACTIONS} # Requested actions
> return args
>
>
> def main() -> None:
> """Run as script: set up logging, parse sys.argv, execute."""
> # WISH Maybe change the type of socket. See SysLogHandler documentation
> handler1 = handlers.SysLogHandler(
> address='/dev/log', facility=handlers.SysLogHandler.LOG_MAIL)
> formatter = logging.Formatter(
> '%(module)s[%(process)d].%(funcName)s: %(levelname)s: %(message)s')
> handler1.setFormatter(formatter)
> LOG.addHandler(handler1)
> try:
> args = parse_command_line()
> # https://www.python.org/dev/peps/pep-0008/#programming-recommendations
> except: # noqa: E722
> LOG.exception(
> 'Exception occurred while parsing command line ("%s")', sys.argv)
> raise
> try:
> if args.verbose:
> handler2 = logging.StreamHandler()
> handler2.setFormatter(formatter)
> LOG.addHandler(handler2)
> level_num = getattr(logging, args.loglevel.upper(), None)
> if not isinstance(level_num, int):
> raise ValueError('Invalid log level: %s' % args.loglevel)
> LOG.setLevel(level_num)
> if args.req['notify']:
> # WISH Compute name from sys.argv[0], like argparse?
> Notify.init('Simplemuch')
> args.func(args)
> except: # noqa: E722
> alert('Exception occurred', 'Command line: "%s"; parsed: %s', sys.argv,
> args, fun=LOG.exception)
> raise
>
>
> if __name__ == '__main__':
> main()
>
> # Local Variables:
> # ispell-local-dictionary: "en_US"
> # End:
>
> --
> - <https://jorgemorais.gitlab.io/justice-for-rms/>
> - If an email of mine arrives at your spam box, please notify me.
> - Please adopt free/libre formats like PDF, ODF, Org, LaTeX, Opus, WebM and 7z.
> - Free/libre software for Replicant, LineageOS and Android: https://f-droid.org
> - [[https://www.gnu.org/philosophy/free-sw.html][What is free software?]]
--
- <https://jorgemorais.gitlab.io/justice-for-rms/>
- If an email of mine arrives at your spam box, please notify me.
- Please adopt free/libre formats like PDF, ODF, Org, LaTeX, Opus, WebM and 7z.
- Free/libre software for Replicant, LineageOS and Android: https://f-droid.org
- [[https://www.gnu.org/philosophy/free-sw.html][What is free software?]]
prev parent reply other threads:[~2020-11-18 13:45 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-11-17 19:19 Corrupted database (notmuch 0.31, Emacs 27.1.50, afew 3.0.1, OfflineIMAP, Python) Jorge P. de Morais Neto
2020-11-18 13:45 ` Jorge P. de Morais Neto [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://notmuchmail.org/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=875z62kccj.fsf@disroot.org \
--to=jorge+list@disroot.org \
--cc=notmuch@notmuchmail.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://yhetil.org/notmuch.git/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).