unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* [PATCH 0/5] Use Xapian query syntax for batch-tag dump/restore
@ 2012-12-25  5:57 Austin Clements
  2012-12-25  5:57 ` [PATCH 1/5] util: Factor out boolean term quoting routine Austin Clements
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Austin Clements @ 2012-12-25  5:57 UTC (permalink / raw)
  To: notmuch

This is a stab at tweaking the new batch-tag dump/restore format to be
more amenable to batch tagging.  Currently, the "query" part of each
line isn't really a Xapian query; it's a hex-encoded message ID
prefixed with "id:".  This is fine for the very limited case of
dump/restore, but it extends poorly to the general queries allowed by
batch tagging.  However, Xapian already has a perfectly good quoting
syntax.  This series switches the batch-tag format to use regular
Xapian queries, using only regular Xapian quoting syntax, so that we
don't have to invent yet another quoting syntax for batch tagging.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/5] util: Factor out boolean term quoting routine
  2012-12-25  5:57 [PATCH 0/5] Use Xapian query syntax for batch-tag dump/restore Austin Clements
@ 2012-12-25  5:57 ` Austin Clements
  2012-12-25 12:25   ` David Bremner
  2012-12-25  5:57 ` [PATCH 2/5] util: Function to parse boolean term queries Austin Clements
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Austin Clements @ 2012-12-25  5:57 UTC (permalink / raw)
  To: notmuch

From: Austin Clements <amdragon@MIT.EDU>

This is now a generic boolean term quoting function.  It performs
minimal quoting to produce user-friendly queries.

This could live in tag-util as well, but it is really nothing specific
to tags (although the conventions are specific to Xapian).

The API is changed from "caller-allocates" to "readline-like".  The
scan for max tag length is pushed down into the quoting routine.
Furthermore, this now combines the term prefix with the quoted term;
arguably this is just as easy to do in the caller, but this will
nicely parallel the boolean term parsing function to be introduced
shortly.

This is an amalgamation of code written by David Bremner and myself.
---
 notmuch-tag.c      |   48 ++++++++++++----------------------------
 util/string-util.c |   62 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 util/string-util.h |    9 ++++++++
 3 files changed, 85 insertions(+), 34 deletions(-)

diff --git a/notmuch-tag.c b/notmuch-tag.c
index 88d559b..fc9d43a 100644
--- a/notmuch-tag.c
+++ b/notmuch-tag.c
@@ -19,6 +19,7 @@
  */
 
 #include "notmuch-client.h"
+#include "string-util.h"
 
 static volatile sig_atomic_t interrupted;
 
@@ -35,25 +36,6 @@ handle_sigint (unused (int sig))
     interrupted = 1;
 }
 
-static char *
-_escape_tag (char *buf, const char *tag)
-{
-    const char *in = tag;
-    char *out = buf;
-
-    /* Boolean terms surrounded by double quotes can contain any
-     * character.  Double quotes are quoted by doubling them. */
-    *out++ = '"';
-    while (*in) {
-	if (*in == '"')
-	    *out++ = '"';
-	*out++ = *in++;
-    }
-    *out++ = '"';
-    *out = 0;
-    return buf;
-}
-
 typedef struct {
     const char *tag;
     notmuch_bool_t remove;
@@ -71,25 +53,16 @@ _optimize_tag_query (void *ctx, const char *orig_query_string,
      * parenthesize and the exclusion part of the query must not use
      * the '-' operator (though the NOT operator is fine). */
 
-    char *escaped, *query_string;
+    char *escaped = NULL;
+    size_t escaped_len = 0;
+    char *query_string;
     const char *join = "";
-    int i;
-    unsigned int max_tag_len = 0;
+    size_t i;
 
     /* Don't optimize if there are no tag changes. */
     if (tag_ops[0].tag == NULL)
 	return talloc_strdup (ctx, orig_query_string);
 
-    /* Allocate a buffer for escaping tags.  This is large enough to
-     * hold a fully escaped tag with every character doubled plus
-     * enclosing quotes and a NUL. */
-    for (i = 0; tag_ops[i].tag; i++)
-	if (strlen (tag_ops[i].tag) > max_tag_len)
-	    max_tag_len = strlen (tag_ops[i].tag);
-    escaped = talloc_array (ctx, char, max_tag_len * 2 + 3);
-    if (! escaped)
-	return NULL;
-
     /* Build the new query string */
     if (strcmp (orig_query_string, "*") == 0)
 	query_string = talloc_strdup (ctx, "(");
@@ -97,10 +70,17 @@ _optimize_tag_query (void *ctx, const char *orig_query_string,
 	query_string = talloc_asprintf (ctx, "( %s ) and (", orig_query_string);
 
     for (i = 0; tag_ops[i].tag && query_string; i++) {
+	/* XXX in case of OOM, query_string will be deallocated when
+	 * ctx is, which might be at shutdown */
+	if (make_boolean_term (ctx,
+			       "tag", tag_ops[i].tag,
+			       &escaped, &escaped_len))
+	    return NULL;
+
 	query_string = talloc_asprintf_append_buffer (
-	    query_string, "%s%stag:%s", join,
+	    query_string, "%s%s%s", join,
 	    tag_ops[i].remove ? "" : "not ",
-	    _escape_tag (escaped, tag_ops[i].tag));
+	    escaped);
 	join = " or ";
     }
 
diff --git a/util/string-util.c b/util/string-util.c
index 44f8cd3..161a4dd 100644
--- a/util/string-util.c
+++ b/util/string-util.c
@@ -20,6 +20,7 @@
 
 
 #include "string-util.h"
+#include "talloc.h"
 
 char *
 strtok_len (char *s, const char *delim, size_t *len)
@@ -32,3 +33,64 @@ strtok_len (char *s, const char *delim, size_t *len)
 
     return *len ? s : NULL;
 }
+
+int
+make_boolean_term (void *ctx, const char *prefix, const char *term,
+		   char **buf, size_t *len)
+{
+    const char *in;
+    char *out;
+    size_t needed = 3;
+    int need_quoting = 0;
+
+    /* Do we need quoting? */
+    for (in = term; *in && !need_quoting; in++)
+	if (*in <= ' ' || *in == ')' || *in == '"')
+	    need_quoting = 1;
+
+    if (need_quoting)
+	for (in = term; *in; in++)
+	    needed += (*in == '"') ? 2 : 1;
+    else
+	needed = strlen (term) + 1;
+
+    /* Reserve space for the prefix */
+    if (prefix)
+	needed += strlen (prefix) + 1;
+
+    if ((*buf == NULL) || (needed > *len)) {
+	*len = 2 * needed;
+	*buf = talloc_realloc (ctx, *buf, char, *len);
+    }
+
+    if (! *buf)
+	return 1;
+
+    out = *buf;
+
+    /* Copy in the prefix */
+    if (prefix) {
+	strcpy (out, prefix);
+	out += strlen (prefix);
+	*out++ = ':';
+    }
+
+    if (! need_quoting) {
+	strcpy (out, term);
+	return 0;
+    }
+
+    /* Quote term by enclosing it in double quotes and doubling any
+     * internal double quotes. */
+    *out++ = '"';
+    in = term;
+    while (*in) {
+	if (*in == '"')
+	    *out++ = '"';
+	*out++ = *in++;
+    }
+    *out++ = '"';
+    *out = '\0';
+
+    return 0;
+}
diff --git a/util/string-util.h b/util/string-util.h
index ac7676c..7475e2c 100644
--- a/util/string-util.h
+++ b/util/string-util.h
@@ -19,4 +19,13 @@
 
 char *strtok_len (char *s, const char *delim, size_t *len);
 
+/* Construct a boolean term query with the specified prefix (e.g.,
+ * "id") and search term, quoting term as necessary.
+ *
+ * Output is into buf; it may be talloc_realloced.
+ * Return: 0 on success, non-zero on memory allocation failure.
+ */
+int make_boolean_term (void *talloc_ctx, const char *prefix, const char *term,
+		       char **buf, size_t *len);
+
 #endif
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/5] util: Function to parse boolean term queries
  2012-12-25  5:57 [PATCH 0/5] Use Xapian query syntax for batch-tag dump/restore Austin Clements
  2012-12-25  5:57 ` [PATCH 1/5] util: Factor out boolean term quoting routine Austin Clements
@ 2012-12-25  5:57 ` Austin Clements
  2012-12-25 14:18   ` David Bremner
  2012-12-25  5:57 ` [PATCH 3/5] dump: Disallow \n in message IDs Austin Clements
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Austin Clements @ 2012-12-25  5:57 UTC (permalink / raw)
  To: notmuch

This reproduces Xapian's parsing rules for boolean term queries.  This
is provided as a generic string utility, but will be used shortly in
notmuch restore to parse and optimize for ID queries.
---
 util/string-util.c |   63 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 util/string-util.h |   11 +++++++++
 2 files changed, 74 insertions(+)

diff --git a/util/string-util.c b/util/string-util.c
index 161a4dd..eaa6c99 100644
--- a/util/string-util.c
+++ b/util/string-util.c
@@ -94,3 +94,66 @@ make_boolean_term (void *ctx, const char *prefix, const char *term,
 
     return 0;
 }
+
+static int
+consume_double_quote (const char **str)
+{
+    if (**str == '"') {
+	++*str;
+	return 1;
+    } else if (strncmp(*str, "\xe2\x80\x9c", 3) == 0 || /* UTF8 0x201c */
+	       strncmp(*str, "\xe2\x80\x9d", 3) == 0) { /* UTF8 0x201d */
+	*str += 3;
+	return 3;
+    } else {
+	return 0;
+    }
+}
+
+int
+parse_boolean_term (void *ctx, const char *str,
+		    char **prefix_out, char **term_out)
+{
+    *prefix_out = *term_out = NULL;
+
+    /* Parse prefix */
+    const char *pos = strchr (str, ':');
+    if (! pos)
+	goto FAIL;
+    *prefix_out = talloc_strndup (ctx, str, pos - str);
+    ++pos;
+
+    /* Implement Xapian's boolean term de-quoting.  This is a nearly
+     * direct translation of QueryParser::Internal::parse_query. */
+    pos = *term_out = talloc_strdup (ctx, pos);
+    if (consume_double_quote (&pos)) {
+	char *out = talloc_strdup (ctx, pos);
+	pos = *term_out = out;
+	while (1) {
+	    if (! *pos) {
+		/* Premature end of string */
+		goto FAIL;
+	    } else if (*pos == '"') {
+		if (*++pos != '"')
+		    break;
+	    } else if (consume_double_quote (&pos)) {
+		break;
+	    }
+	    *out++ = *pos++;
+	}
+	if (*pos)
+	    goto FAIL;
+	*out = '\0';
+    } else {
+	while (*pos > ' ' && *pos != ')')
+	    ++pos;
+	if (*pos)
+	    goto FAIL;
+    }
+    return 0;
+
+ FAIL:
+    talloc_free (*prefix_out);
+    talloc_free (*term_out);
+    return 1;
+}
diff --git a/util/string-util.h b/util/string-util.h
index 7475e2c..e4e4c42 100644
--- a/util/string-util.h
+++ b/util/string-util.h
@@ -28,4 +28,15 @@ char *strtok_len (char *s, const char *delim, size_t *len);
 int make_boolean_term (void *talloc_ctx, const char *prefix, const char *term,
 		       char **buf, size_t *len);
 
+/* Parse a boolean term query, returning the prefix in *prefix_out and
+ * the term in *term_out.  *prefix_out and *term_out will be talloc'd
+ * with context ctx.
+ *
+ * Return: 0 on success, non-zero on parse error (including trailing
+ * data in str).
+ */
+int
+parse_boolean_term (void *ctx, const char *str,
+		    char **prefix_out, char **term_out);
+
 #endif
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/5] dump: Disallow \n in message IDs
  2012-12-25  5:57 [PATCH 0/5] Use Xapian query syntax for batch-tag dump/restore Austin Clements
  2012-12-25  5:57 ` [PATCH 1/5] util: Factor out boolean term quoting routine Austin Clements
  2012-12-25  5:57 ` [PATCH 2/5] util: Function to parse boolean term queries Austin Clements
@ 2012-12-25  5:57 ` Austin Clements
  2012-12-25 14:21   ` David Bremner
  2012-12-25  5:57 ` [PATCH 4/5] dump/restore: Use Xapian queries for batch-tag format Austin Clements
  2012-12-25  5:57 ` [PATCH 5/5] man: Update notmuch-dump(1) for new " Austin Clements
  4 siblings, 1 reply; 14+ messages in thread
From: Austin Clements @ 2012-12-25  5:57 UTC (permalink / raw)
  To: notmuch

When we switch to using regular Xapian queries in the dump format, \n
will cause problems, so we disallow it.  Specially, while Xapian can
quote and parse queries containing \n without difficultly, quoted
queries containing \n still span multiple lines, which breaks the
line-orientedness of the dump format.  Strictly speaking, we could
still round-trip these, but it would significantly complicate restore
as well as scripts that deal with tag dumps.  This complexity would
come at absolutely no benefit: because of the RFC 2822 unfolding
rules, no amount of standards negligence can produce a message with a
message ID containing a line break (not even Outlook can do it!).

Hence, we simply disallow it.
---
 notmuch-dump.c       |    9 +++++++++
 test/random-corpus.c |    2 ++
 2 files changed, 11 insertions(+)

diff --git a/notmuch-dump.c b/notmuch-dump.c
index d2dad40..29d79da 100644
--- a/notmuch-dump.c
+++ b/notmuch-dump.c
@@ -132,6 +132,15 @@ notmuch_dump_command (unused (void *ctx), int argc, char *argv[])
 	if (output_format == DUMP_FORMAT_SUP) {
 	    fputs (")\n", output);
 	} else {
+	    if (strchr (message_id, '\n')) {
+		/* This will produce a line break in the output, which
+		 * would be difficult to handle in tools.  However,
+		 * it's also impossible to produce an email containing
+		 * a line break in a message ID because of unfolding,
+		 * so we can safely disallow it. */
+		fprintf (stderr, "Error: cannot dump message id containing line break: %s\n", message_id);
+		return 1;
+	    }
 	    if (hex_encode (notmuch, message_id,
 			    &buffer, &buffer_size) != HEX_SUCCESS) {
 		    fprintf (stderr, "Error: failed to hex-encode msg-id %s\n",
diff --git a/test/random-corpus.c b/test/random-corpus.c
index f354d4b..d0e3e8f 100644
--- a/test/random-corpus.c
+++ b/test/random-corpus.c
@@ -97,6 +97,8 @@ random_utf8_string (void *ctx, size_t char_count)
 	}
 
 	randomchar = random_unichar ();
+	if (randomchar == '\n')
+	    randomchar = 'x';
 
 	written = g_unichar_to_utf8 (randomchar, buf + offset);
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/5] dump/restore: Use Xapian queries for batch-tag format
  2012-12-25  5:57 [PATCH 0/5] Use Xapian query syntax for batch-tag dump/restore Austin Clements
                   ` (2 preceding siblings ...)
  2012-12-25  5:57 ` [PATCH 3/5] dump: Disallow \n in message IDs Austin Clements
@ 2012-12-25  5:57 ` Austin Clements
  2012-12-25  5:57 ` [PATCH 5/5] man: Update notmuch-dump(1) for new " Austin Clements
  4 siblings, 0 replies; 14+ messages in thread
From: Austin Clements @ 2012-12-25  5:57 UTC (permalink / raw)
  To: notmuch

This switches the new batch-tag format away from using a home-grown
hex-encoding scheme for message IDs in the dump to simply using Xapian
queries with Xapian quoting syntax.

This has a variety of advantages beyond presenting a cleaner and more
consistent interface.  Foremost is that it will dramatically simplify
the quoting for batch tagging, which shares the same input format.
While the hex-encoding is no better or worse for the simple ID queries
used by dump/restore, it becomes onerous for general-purpose queries
used in batch tagging.  It also better handles strange cases like
"id:foo and bar", since this is no longer syntactically valid.
---
 notmuch-dump.c    |    9 +++++----
 notmuch-restore.c |   22 ++++++++++------------
 tag-util.c        |    6 ------
 test/dump-restore |   11 +++++------
 4 files changed, 20 insertions(+), 28 deletions(-)

diff --git a/notmuch-dump.c b/notmuch-dump.c
index 29d79da..bf01a39 100644
--- a/notmuch-dump.c
+++ b/notmuch-dump.c
@@ -20,6 +20,7 @@
 
 #include "notmuch-client.h"
 #include "dump-restore-private.h"
+#include "string-util.h"
 
 int
 notmuch_dump_command (unused (void *ctx), int argc, char *argv[])
@@ -141,13 +142,13 @@ notmuch_dump_command (unused (void *ctx), int argc, char *argv[])
 		fprintf (stderr, "Error: cannot dump message id containing line break: %s\n", message_id);
 		return 1;
 	    }
-	    if (hex_encode (notmuch, message_id,
-			    &buffer, &buffer_size) != HEX_SUCCESS) {
-		    fprintf (stderr, "Error: failed to hex-encode msg-id %s\n",
+	    if (make_boolean_term (notmuch, "id", message_id,
+				   &buffer, &buffer_size)) {
+		    fprintf (stderr, "Error: failed to quote message id %s\n",
 			     message_id);
 		    return 1;
 	    }
-	    fprintf (output, " -- id:%s\n", buffer);
+	    fprintf (output, " -- %s\n", buffer);
 	}
 
 	notmuch_message_destroy (message);
diff --git a/notmuch-restore.c b/notmuch-restore.c
index 9ed9b51..77a4c27 100644
--- a/notmuch-restore.c
+++ b/notmuch-restore.c
@@ -207,7 +207,7 @@ notmuch_restore_command (unused (void *ctx), int argc, char *argv[])
 	    INTERNAL_ERROR ("compile time constant regex failed.");
 
     do {
-	char *query_string;
+	char *query_string, *prefix, *term;
 
 	if (line_ctx != NULL)
 	    talloc_free (line_ctx);
@@ -220,19 +220,17 @@ notmuch_restore_command (unused (void *ctx), int argc, char *argv[])
 				  &query_string, tag_ops);
 
 	    if (ret == 0) {
-		if (strncmp ("id:", query_string, 3) != 0) {
-		    fprintf (stderr, "Warning: unsupported query: %s\n", query_string);
+		ret = parse_boolean_term (line_ctx, query_string,
+					  &prefix, &term);
+		if (ret) {
+		    fprintf (stderr, "Warning: cannot parse query: %s\n",
+			     query_string);
+		    continue;
+		} else if (strcmp ("id", prefix) != 0) {
+		    fprintf (stderr, "Warning: not an id query: %s\n", query_string);
 		    continue;
 		}
-		/* delete id: from front of string; tag_message
-		 * expects a raw message-id.
-		 *
-		 * XXX: Note that query string id:foo and bar will be
-		 * interpreted as a message id "foo and bar". This
-		 * should eventually be fixed to give a better error
-		 * message.
-		 */
-		query_string = query_string + 3;
+		query_string = term;
 	    }
 	}
 
diff --git a/tag-util.c b/tag-util.c
index 705b7ba..e4e5dda 100644
--- a/tag-util.c
+++ b/tag-util.c
@@ -124,12 +124,6 @@ parse_tag_line (void *ctx, char *line,
     }
 
     /* tok now points to the query string */
-    if (hex_decode_inplace (tok) != HEX_SUCCESS) {
-	ret = line_error (TAG_PARSE_INVALID, line_for_error,
-			  "hex decoding of query %s failed", tok);
-	goto DONE;
-    }
-
     *query_string = tok;
 
   DONE:
diff --git a/test/dump-restore b/test/dump-restore
index 6a989b6..aecc393 100755
--- a/test/dump-restore
+++ b/test/dump-restore
@@ -195,23 +195,22 @@ a
 
 # the previous line was blank; also no yelling please
 +%zz -- id:whatever
-+e +f id:%yy
++e +f id:"
++e +f tag:abc
 # the next non-comment line should report an an empty tag error for
 # batch tagging, but not for restore
 + +e -- id:20091117232137.GA7669@griffis1.net
-# highlight the sketchy id parsing; this should be last
-+g -- id:foo and bar
 EOF
 
 cat <<EOF > EXPECTED
-Warning: unsupported query: a
+Warning: cannot parse query: a
 Warning: no query string [+0]
 Warning: no query string [+a +b]
 Warning: missing query string [+a +b ]
 Warning: no query string after -- [+c +d --]
 Warning: hex decoding of tag %zz failed [+%zz -- id:whatever]
-Warning: hex decoding of query id:%yy failed [+e +f id:%yy]
-Warning: cannot apply tags to missing message: foo and bar
+Warning: cannot parse query: id:"
+Warning: not an id query: tag:abc
 EOF
 
 test_expect_equal_file EXPECTED OUTPUT
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 5/5] man: Update notmuch-dump(1) for new batch-tag format
  2012-12-25  5:57 [PATCH 0/5] Use Xapian query syntax for batch-tag dump/restore Austin Clements
                   ` (3 preceding siblings ...)
  2012-12-25  5:57 ` [PATCH 4/5] dump/restore: Use Xapian queries for batch-tag format Austin Clements
@ 2012-12-25  5:57 ` Austin Clements
  2012-12-25 14:47   ` David Bremner
  2012-12-25 18:05   ` David Bremner
  4 siblings, 2 replies; 14+ messages in thread
From: Austin Clements @ 2012-12-25  5:57 UTC (permalink / raw)
  To: notmuch

---
 man/man1/notmuch-dump.1 |   11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/man/man1/notmuch-dump.1 b/man/man1/notmuch-dump.1
index 770b00f..7bd6def 100644
--- a/man/man1/notmuch-dump.1
+++ b/man/man1/notmuch-dump.1
@@ -64,13 +64,16 @@ and tags containing whitespace or non-\fBascii\fR(7) characters.
 Each line has the form
 
 .RS 4
-.RI "+<" "encoded-tag" "> " "" "+<" "encoded-tag" "> ... -- " "" " id:<" encoded-message-id >
+.RI "+<" "encoded-tag" "> " "" "+<" "encoded-tag" "> ... -- " "" " id:<" quoted-message-id >
 
-where encoded means that every byte not matching the regex
+Tags are hex-encoded by replacing every byte not matching the regex
 .B [A-Za-z0-9@=.,_+-]
-is replace by
+with
 .B %nn
-where nn is the two digit hex encoding.
+where nn is the two digit hex encoding.  The message ID is a valid Xapian
+query, quoted using Xapian boolean term quoting rules: if the ID contains
+whitespace or a close paren or starts with a double quote, it must be
+enclosed in double quotes and double quotes inside the ID must be doubled.
 The astute reader will notice this is a special case of the batch input
 format for \fBnotmuch-tag\fR(1); note that the single message-id query is
 mandatory for \fBnotmuch-restore\fR(1).
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/5] util: Factor out boolean term quoting routine
  2012-12-25  5:57 ` [PATCH 1/5] util: Factor out boolean term quoting routine Austin Clements
@ 2012-12-25 12:25   ` David Bremner
  0 siblings, 0 replies; 14+ messages in thread
From: David Bremner @ 2012-12-25 12:25 UTC (permalink / raw)
  To: Austin Clements, notmuch

Austin Clements <amdragon@MIT.EDU> writes:
>
> This could live in tag-util as well, but it is really nothing specific
> to tags (although the conventions are specific to Xapian).
>
> Furthermore, this now combines the term prefix with the quoted term;
> arguably this is just as easy to do in the caller, but this will
> nicely parallel the boolean term parsing function to be introduced
> shortly.

At first glance, I found this a bit too notmuch-specific to go in
util. On second glance, I found my first reaction somewhat
bizarre. Perhaps at some point we should drop a README file in util
explaining what should go there.

Other than that, LGTM
d

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/5] util: Function to parse boolean term queries
  2012-12-25  5:57 ` [PATCH 2/5] util: Function to parse boolean term queries Austin Clements
@ 2012-12-25 14:18   ` David Bremner
  2012-12-25 14:34     ` David Bremner
  2012-12-26  1:23     ` Austin Clements
  0 siblings, 2 replies; 14+ messages in thread
From: David Bremner @ 2012-12-25 14:18 UTC (permalink / raw)
  To: Austin Clements, notmuch

Austin Clements <amdragon@MIT.EDU> writes:

> +    if (consume_double_quote (&pos)) {
> +	char *out = talloc_strdup (ctx, pos);
> +	pos = *term_out = out;
> +	while (1) {

Overall the control flow here is a bit tricky to follow. I'm not sure if
a real loop condition would help or make it worse.

> +	    if (! *pos) {
> +		/* Premature end of string */
> +		goto FAIL;
> +	    } else if (*pos == '"') {
> +		if (*++pos != '"')
> +		    break;
> +	    } else if (consume_double_quote (&pos)) {
> +		break;
> +	    }

I'm confused by the asymmetry here. Quoted strings can start with
unicode quotes, but internally can only have ascii '"'? Is this
documented somewhere?

> +    } else {
> +	while (*pos > ' ' && *pos != ')')
> +	    ++pos;
> +	if (*pos)
> +	    goto FAIL;
> +    }

So if there is no quote, we skip the part after the ':'? I guess I
probably missed something because that doesn't sound like the intended
behaviour.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/5] dump: Disallow \n in message IDs
  2012-12-25  5:57 ` [PATCH 3/5] dump: Disallow \n in message IDs Austin Clements
@ 2012-12-25 14:21   ` David Bremner
  0 siblings, 0 replies; 14+ messages in thread
From: David Bremner @ 2012-12-25 14:21 UTC (permalink / raw)
  To: Austin Clements, notmuch

Austin Clements <amdragon@MIT.EDU> writes:
>  
>  	randomchar = random_unichar ();
> +	if (randomchar == '\n')
> +	    randomchar = 'x';
>  

what about something like 

do {
   randomchar = random_unichar ();
}  while (randomchar == '\n');

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/5] util: Function to parse boolean term queries
  2012-12-25 14:18   ` David Bremner
@ 2012-12-25 14:34     ` David Bremner
  2012-12-26  1:23     ` Austin Clements
  1 sibling, 0 replies; 14+ messages in thread
From: David Bremner @ 2012-12-25 14:34 UTC (permalink / raw)
  To: Austin Clements, notmuch

David Bremner <david@tethera.net> writes:
>
> So if there is no quote, we skip the part after the ':'? I guess I
> probably missed something because that doesn't sound like the intended
> behaviour.

Indeed the following addition to the test shows it works fine in
context. So I guess I just don't follow this control flow very well.

diff --git a/test/dump-restore b/test/dump-restore
index aecc393..f9ae5b3 100755
--- a/test/dump-restore
+++ b/test/dump-restore
@@ -200,6 +200,8 @@ a
 # the next non-comment line should report an an empty tag error for
 # batch tagging, but not for restore
 + +e -- id:20091117232137.GA7669@griffis1.net
+# valid id, but warning about missing message
++e id:missing_message_id
 EOF
 
 cat <<EOF > EXPECTED
@@ -211,6 +213,7 @@ Warning: no query string after -- [+c +d --]
 Warning: hex decoding of tag %zz failed [+%zz -- id:whatever]
 Warning: cannot parse query: id:"
 Warning: not an id query: tag:abc
+Warning: cannot apply tags to missing message: missing_message_id
 EOF
 
 test_expect_equal_file EXPECTED OUTPUT

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/5] man: Update notmuch-dump(1) for new batch-tag format
  2012-12-25  5:57 ` [PATCH 5/5] man: Update notmuch-dump(1) for new " Austin Clements
@ 2012-12-25 14:47   ` David Bremner
  2012-12-25 15:18     ` David Bremner
  2012-12-25 18:05   ` David Bremner
  1 sibling, 1 reply; 14+ messages in thread
From: David Bremner @ 2012-12-25 14:47 UTC (permalink / raw)
  To: Austin Clements, notmuch


Overall I agree this is conceptually cleaner. Transforming between
quoting formats is inherently delicate; I suspect there will always be
at least one special case missed.  I did find the dequoting code a bit
baffling, but this is more of an implementation issue.

There is a certain cognitive dissonance in using hex-quoting for tags
and xapian-quoting for message-ids.  Did you think about the
implications of using xapian quoting for tags as well?

d

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/5] man: Update notmuch-dump(1) for new batch-tag format
  2012-12-25 14:47   ` David Bremner
@ 2012-12-25 15:18     ` David Bremner
  0 siblings, 0 replies; 14+ messages in thread
From: David Bremner @ 2012-12-25 15:18 UTC (permalink / raw)
  To: Austin Clements, notmuch

David Bremner <david@tethera.net> writes:
>
> There is a certain cognitive dissonance in using hex-quoting for tags
> and xapian-quoting for message-ids.  Did you think about the
> implications of using xapian quoting for tags as well?
>

At risk of talking to myself, I guess one loses the space delimited
lists of tags, which is probably pretty handy for scripting.

d

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/5] man: Update notmuch-dump(1) for new batch-tag format
  2012-12-25  5:57 ` [PATCH 5/5] man: Update notmuch-dump(1) for new " Austin Clements
  2012-12-25 14:47   ` David Bremner
@ 2012-12-25 18:05   ` David Bremner
  1 sibling, 0 replies; 14+ messages in thread
From: David Bremner @ 2012-12-25 18:05 UTC (permalink / raw)
  To: Austin Clements, notmuch

Austin Clements <amdragon@MIT.EDU> writes:

> ---
>  man/man1/notmuch-dump.1 |   11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
>

I guess the short description in notmuch-restore.1 needs updating as
well.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/5] util: Function to parse boolean term queries
  2012-12-25 14:18   ` David Bremner
  2012-12-25 14:34     ` David Bremner
@ 2012-12-26  1:23     ` Austin Clements
  1 sibling, 0 replies; 14+ messages in thread
From: Austin Clements @ 2012-12-26  1:23 UTC (permalink / raw)
  To: David Bremner; +Cc: notmuch

Quoth David Bremner on Dec 25 at 10:18 am:
> Austin Clements <amdragon@MIT.EDU> writes:
> 
> > +    if (consume_double_quote (&pos)) {
> > +	char *out = talloc_strdup (ctx, pos);
> > +	pos = *term_out = out;
> > +	while (1) {
> 
> Overall the control flow here is a bit tricky to follow. I'm not sure if
> a real loop condition would help or make it worse.
> 
> > +	    if (! *pos) {
> > +		/* Premature end of string */
> > +		goto FAIL;
> > +	    } else if (*pos == '"') {
> > +		if (*++pos != '"')
> > +		    break;
> > +	    } else if (consume_double_quote (&pos)) {
> > +		break;
> > +	    }
> 
> I'm confused by the asymmetry here. Quoted strings can start with
> unicode quotes, but internally can only have ascii '"'? Is this
> documented somewhere?

Only in the source, to my knowledge.  Here's how Xapian parses these
things (where 'it' is a UTF8 string iterator):

if (it != end && is_double_quote(*it)) {
    // Quoted boolean term (can contain any character).
    ++it;
    while (it != end) {
	if (*it == '"') {
	    // Interpret "" as an escaped ".
	    if (++it == end || *it != '"')
		break;
	} else if (is_double_quote(*it)) {
	    ++it;
	    break;
	}
	Unicode::append_utf8(name, *it++);
    }
} else {
    // Can't boolean filter prefix a subexpression, so
    // just use anything following the prefix until the
    // next space or ')' as part of the boolean filter
    // term.
    while (it != end && *it > ' ' && *it != ')')
	Unicode::append_utf8(name, *it++);
}

> > +    } else {
> > +	while (*pos > ' ' && *pos != ')')
> > +	    ++pos;
> > +	if (*pos)
> > +	    goto FAIL;
> > +    }
> 
> So if there is no quote, we skip the part after the ':'? I guess I
> probably missed something because that doesn't sound like the intended
> behaviour.

This isn't skipping it; it's checking its well-formedness.  In this
case, *term_out already points to a correct string that can be used
literally; we just have to check that there's no trailing garbage
after the boolean query.

This is certainly worth commenting.

For the record, I also tried passing the query straight to the
library, without parsing it in the CLI (and simply checking that the
query returned exactly one result), and it was noticeably slower (the
restore performance test took between 3.82 and 5.25 seconds for the
code in this series and ~7.2 seconds using a general query.)

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-12-26  1:23 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-25  5:57 [PATCH 0/5] Use Xapian query syntax for batch-tag dump/restore Austin Clements
2012-12-25  5:57 ` [PATCH 1/5] util: Factor out boolean term quoting routine Austin Clements
2012-12-25 12:25   ` David Bremner
2012-12-25  5:57 ` [PATCH 2/5] util: Function to parse boolean term queries Austin Clements
2012-12-25 14:18   ` David Bremner
2012-12-25 14:34     ` David Bremner
2012-12-26  1:23     ` Austin Clements
2012-12-25  5:57 ` [PATCH 3/5] dump: Disallow \n in message IDs Austin Clements
2012-12-25 14:21   ` David Bremner
2012-12-25  5:57 ` [PATCH 4/5] dump/restore: Use Xapian queries for batch-tag format Austin Clements
2012-12-25  5:57 ` [PATCH 5/5] man: Update notmuch-dump(1) for new " Austin Clements
2012-12-25 14:47   ` David Bremner
2012-12-25 15:18     ` David Bremner
2012-12-25 18:05   ` David Bremner

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).