unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* [PATCH v3 0/5] Use Xapian query syntax for batch-tag dump/restore
@ 2012-12-28 18:26 Austin Clements
  2012-12-28 18:26 ` [PATCH v3 1/5] util: Factor out boolean term quoting routine Austin Clements
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Austin Clements @ 2012-12-28 18:26 UTC (permalink / raw)
  To: notmuch

This obsoletes

  id:1356493723-11085-1-git-send-email-amdragon@mit.edu

This version improves the documentation comment for make_boolean_term
and hopefully simplifies parse_boolean_term (though in a somewhat
different way than David suggested).

The diff relative to v2 follows

diff --git a/util/string-util.c b/util/string-util.c
index db01b4b..83b4953 100644
--- a/util/string-util.c
+++ b/util/string-util.c
@@ -112,11 +112,12 @@ parse_boolean_term (void *ctx, const char *str,
 
     /* Implement de-quoting compatible with make_boolean_term. */
     if (*pos == '"') {
-	char *out = talloc_strdup (ctx, pos + 1);
+	char *out = talloc_array (ctx, char, strlen (pos));
 	int closed = 0;
-	/* Find the closing quote and un-double doubled internal
-	 * quotes. */
-	for (pos = *term_out = out; *pos; ) {
+	*term_out = out;
+	/* Skip the opening quote, find the closing quote, and
+	 * un-double doubled internal quotes. */
+	for (++pos; *pos; ) {
 	    if (*pos == '"') {
 		++pos;
 		if (*pos != '"') {
@@ -133,12 +134,15 @@ parse_boolean_term (void *ctx, const char *str,
 	    goto FAIL;
 	*out = '\0';
     } else {
-	*term_out = talloc_strdup (ctx, pos);
+	const char *start = pos;
 	/* Check for text after the boolean term. */
 	while (*pos > ' ' && *pos != ')')
 	    ++pos;
 	if (*pos)
 	    goto FAIL;
+	/* No trailing text; dup the string so the caller can free
+	 * it. */
+	*term_out = talloc_strdup (ctx, start);
     }
     return 0;
 
diff --git a/util/string-util.h b/util/string-util.h
index aff2d65..43d49d0 100644
--- a/util/string-util.h
+++ b/util/string-util.h
@@ -20,7 +20,12 @@
 char *strtok_len (char *s, const char *delim, size_t *len);
 
 /* Construct a boolean term query with the specified prefix (e.g.,
- * "id") and search term, quoting term as necessary.
+ * "id") and search term, quoting term as necessary.  Specifically, if
+ * term contains any non-printable ASCII characters, non-ASCII
+ * characters, close parenthesis or double quotes, it will be enclosed
+ * in double quotes and any internal double quotes will be doubled
+ * (e.g. a"b -> "a""b").  The result will be a valid notmuch query and
+ * can be parsed by parse_boolean_term.
  *
  * Output is into buf; it may be talloc_realloced.
  * Return: 0 on success, non-zero on memory allocation failure.

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 1/5] util: Factor out boolean term quoting routine
  2012-12-28 18:26 [PATCH v3 0/5] Use Xapian query syntax for batch-tag dump/restore Austin Clements
@ 2012-12-28 18:26 ` Austin Clements
  2012-12-28 18:26 ` [PATCH v3 2/5] util: Function to parse boolean term queries Austin Clements
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Austin Clements @ 2012-12-28 18:26 UTC (permalink / raw)
  To: notmuch

From: Austin Clements <amdragon@MIT.EDU>

This is now a generic boolean term quoting function.  It performs
minimal quoting to produce user-friendly queries.

This could live in tag-util as well, but it is really nothing specific
to tags (although the conventions are specific to Xapian).

The API is changed from "caller-allocates" to "readline-like".  The
scan for max tag length is pushed down into the quoting routine.
Furthermore, this now combines the term prefix with the quoted term;
arguably this is just as easy to do in the caller, but this will
nicely parallel the boolean term parsing function to be introduced
shortly.

This is an amalgamation of code written by David Bremner and myself.
---
 notmuch-tag.c      |   48 ++++++++++++---------------------------
 util/string-util.c |   64 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 util/string-util.h |   14 ++++++++++++
 3 files changed, 92 insertions(+), 34 deletions(-)

diff --git a/notmuch-tag.c b/notmuch-tag.c
index 88d559b..fc9d43a 100644
--- a/notmuch-tag.c
+++ b/notmuch-tag.c
@@ -19,6 +19,7 @@
  */
 
 #include "notmuch-client.h"
+#include "string-util.h"
 
 static volatile sig_atomic_t interrupted;
 
@@ -35,25 +36,6 @@ handle_sigint (unused (int sig))
     interrupted = 1;
 }
 
-static char *
-_escape_tag (char *buf, const char *tag)
-{
-    const char *in = tag;
-    char *out = buf;
-
-    /* Boolean terms surrounded by double quotes can contain any
-     * character.  Double quotes are quoted by doubling them. */
-    *out++ = '"';
-    while (*in) {
-	if (*in == '"')
-	    *out++ = '"';
-	*out++ = *in++;
-    }
-    *out++ = '"';
-    *out = 0;
-    return buf;
-}
-
 typedef struct {
     const char *tag;
     notmuch_bool_t remove;
@@ -71,25 +53,16 @@ _optimize_tag_query (void *ctx, const char *orig_query_string,
      * parenthesize and the exclusion part of the query must not use
      * the '-' operator (though the NOT operator is fine). */
 
-    char *escaped, *query_string;
+    char *escaped = NULL;
+    size_t escaped_len = 0;
+    char *query_string;
     const char *join = "";
-    int i;
-    unsigned int max_tag_len = 0;
+    size_t i;
 
     /* Don't optimize if there are no tag changes. */
     if (tag_ops[0].tag == NULL)
 	return talloc_strdup (ctx, orig_query_string);
 
-    /* Allocate a buffer for escaping tags.  This is large enough to
-     * hold a fully escaped tag with every character doubled plus
-     * enclosing quotes and a NUL. */
-    for (i = 0; tag_ops[i].tag; i++)
-	if (strlen (tag_ops[i].tag) > max_tag_len)
-	    max_tag_len = strlen (tag_ops[i].tag);
-    escaped = talloc_array (ctx, char, max_tag_len * 2 + 3);
-    if (! escaped)
-	return NULL;
-
     /* Build the new query string */
     if (strcmp (orig_query_string, "*") == 0)
 	query_string = talloc_strdup (ctx, "(");
@@ -97,10 +70,17 @@ _optimize_tag_query (void *ctx, const char *orig_query_string,
 	query_string = talloc_asprintf (ctx, "( %s ) and (", orig_query_string);
 
     for (i = 0; tag_ops[i].tag && query_string; i++) {
+	/* XXX in case of OOM, query_string will be deallocated when
+	 * ctx is, which might be at shutdown */
+	if (make_boolean_term (ctx,
+			       "tag", tag_ops[i].tag,
+			       &escaped, &escaped_len))
+	    return NULL;
+
 	query_string = talloc_asprintf_append_buffer (
-	    query_string, "%s%stag:%s", join,
+	    query_string, "%s%s%s", join,
 	    tag_ops[i].remove ? "" : "not ",
-	    _escape_tag (escaped, tag_ops[i].tag));
+	    escaped);
 	join = " or ";
     }
 
diff --git a/util/string-util.c b/util/string-util.c
index 44f8cd3..e4bea21 100644
--- a/util/string-util.c
+++ b/util/string-util.c
@@ -20,6 +20,7 @@
 
 
 #include "string-util.h"
+#include "talloc.h"
 
 char *
 strtok_len (char *s, const char *delim, size_t *len)
@@ -32,3 +33,66 @@ strtok_len (char *s, const char *delim, size_t *len)
 
     return *len ? s : NULL;
 }
+
+int
+make_boolean_term (void *ctx, const char *prefix, const char *term,
+		   char **buf, size_t *len)
+{
+    const char *in;
+    char *out;
+    size_t needed = 3;
+    int need_quoting = 0;
+
+    /* Do we need quoting?  To be paranoid, we quote anything
+     * containing a quote, even though it only matters at the
+     * beginning, and anything containing non-ASCII text. */
+    for (in = term; *in && !need_quoting; in++)
+	if (*in <= ' ' || *in == ')' || *in == '"' || (unsigned char)*in > 127)
+	    need_quoting = 1;
+
+    if (need_quoting)
+	for (in = term; *in; in++)
+	    needed += (*in == '"') ? 2 : 1;
+    else
+	needed = strlen (term) + 1;
+
+    /* Reserve space for the prefix */
+    if (prefix)
+	needed += strlen (prefix) + 1;
+
+    if ((*buf == NULL) || (needed > *len)) {
+	*len = 2 * needed;
+	*buf = talloc_realloc (ctx, *buf, char, *len);
+    }
+
+    if (! *buf)
+	return 1;
+
+    out = *buf;
+
+    /* Copy in the prefix */
+    if (prefix) {
+	strcpy (out, prefix);
+	out += strlen (prefix);
+	*out++ = ':';
+    }
+
+    if (! need_quoting) {
+	strcpy (out, term);
+	return 0;
+    }
+
+    /* Quote term by enclosing it in double quotes and doubling any
+     * internal double quotes. */
+    *out++ = '"';
+    in = term;
+    while (*in) {
+	if (*in == '"')
+	    *out++ = '"';
+	*out++ = *in++;
+    }
+    *out++ = '"';
+    *out = '\0';
+
+    return 0;
+}
diff --git a/util/string-util.h b/util/string-util.h
index ac7676c..b8844a3 100644
--- a/util/string-util.h
+++ b/util/string-util.h
@@ -19,4 +19,18 @@
 
 char *strtok_len (char *s, const char *delim, size_t *len);
 
+/* Construct a boolean term query with the specified prefix (e.g.,
+ * "id") and search term, quoting term as necessary.  Specifically, if
+ * term contains any non-printable ASCII characters, non-ASCII
+ * characters, close parenthesis or double quotes, it will be enclosed
+ * in double quotes and any internal double quotes will be doubled
+ * (e.g. a"b -> "a""b").  The result will be a valid notmuch query and
+ * can be parsed by parse_boolean_term.
+ *
+ * Output is into buf; it may be talloc_realloced.
+ * Return: 0 on success, non-zero on memory allocation failure.
+ */
+int make_boolean_term (void *talloc_ctx, const char *prefix, const char *term,
+		       char **buf, size_t *len);
+
 #endif
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 2/5] util: Function to parse boolean term queries
  2012-12-28 18:26 [PATCH v3 0/5] Use Xapian query syntax for batch-tag dump/restore Austin Clements
  2012-12-28 18:26 ` [PATCH v3 1/5] util: Factor out boolean term quoting routine Austin Clements
@ 2012-12-28 18:26 ` Austin Clements
  2012-12-29 16:50   ` Mark Walters
  2012-12-29 21:58   ` Tomi Ollila
  2012-12-28 18:26 ` [PATCH v3 3/5] dump: Disallow \n in message IDs Austin Clements
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 9+ messages in thread
From: Austin Clements @ 2012-12-28 18:26 UTC (permalink / raw)
  To: notmuch

This parses the subset of Xapian's boolean term quoting rules that are
used by make_boolean_term.  This is provided as a generic string
utility, but will be used shortly in notmuch restore to parse and
optimize for ID queries.
---
 util/string-util.c |   55 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 util/string-util.h |   11 +++++++++++
 2 files changed, 66 insertions(+)

diff --git a/util/string-util.c b/util/string-util.c
index e4bea21..83b4953 100644
--- a/util/string-util.c
+++ b/util/string-util.c
@@ -96,3 +96,58 @@ make_boolean_term (void *ctx, const char *prefix, const char *term,
 
     return 0;
 }
+
+int
+parse_boolean_term (void *ctx, const char *str,
+		    char **prefix_out, char **term_out)
+{
+    *prefix_out = *term_out = NULL;
+
+    /* Parse prefix */
+    const char *pos = strchr (str, ':');
+    if (! pos)
+	goto FAIL;
+    *prefix_out = talloc_strndup (ctx, str, pos - str);
+    ++pos;
+
+    /* Implement de-quoting compatible with make_boolean_term. */
+    if (*pos == '"') {
+	char *out = talloc_array (ctx, char, strlen (pos));
+	int closed = 0;
+	*term_out = out;
+	/* Skip the opening quote, find the closing quote, and
+	 * un-double doubled internal quotes. */
+	for (++pos; *pos; ) {
+	    if (*pos == '"') {
+		++pos;
+		if (*pos != '"') {
+		    /* Found the closing quote. */
+		    closed = 1;
+		    break;
+		}
+	    }
+	    *out++ = *pos++;
+	}
+	/* Did the term terminate without a closing quote or is there
+	 * trailing text after the closing quote? */
+	if (!closed || *pos)
+	    goto FAIL;
+	*out = '\0';
+    } else {
+	const char *start = pos;
+	/* Check for text after the boolean term. */
+	while (*pos > ' ' && *pos != ')')
+	    ++pos;
+	if (*pos)
+	    goto FAIL;
+	/* No trailing text; dup the string so the caller can free
+	 * it. */
+	*term_out = talloc_strdup (ctx, start);
+    }
+    return 0;
+
+ FAIL:
+    talloc_free (*prefix_out);
+    talloc_free (*term_out);
+    return 1;
+}
diff --git a/util/string-util.h b/util/string-util.h
index b8844a3..43d49d0 100644
--- a/util/string-util.h
+++ b/util/string-util.h
@@ -33,4 +33,15 @@ char *strtok_len (char *s, const char *delim, size_t *len);
 int make_boolean_term (void *talloc_ctx, const char *prefix, const char *term,
 		       char **buf, size_t *len);
 
+/* Parse a boolean term query produced by make_boolean_term, returning
+ * the prefix in *prefix_out and the term in *term_out.  *prefix_out
+ * and *term_out will be talloc'd with context ctx.
+ *
+ * Return: 0 on success, non-zero on parse error (including trailing
+ * data in str).
+ */
+int
+parse_boolean_term (void *ctx, const char *str,
+		    char **prefix_out, char **term_out);
+
 #endif
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 3/5] dump: Disallow \n in message IDs
  2012-12-28 18:26 [PATCH v3 0/5] Use Xapian query syntax for batch-tag dump/restore Austin Clements
  2012-12-28 18:26 ` [PATCH v3 1/5] util: Factor out boolean term quoting routine Austin Clements
  2012-12-28 18:26 ` [PATCH v3 2/5] util: Function to parse boolean term queries Austin Clements
@ 2012-12-28 18:26 ` Austin Clements
  2012-12-28 18:26 ` [PATCH v3 4/5] dump/restore: Use Xapian queries for batch-tag format Austin Clements
  2012-12-28 18:26 ` [PATCH v3 5/5] man: Update notmuch-dump(1) and notmuch-restore(1) Austin Clements
  4 siblings, 0 replies; 9+ messages in thread
From: Austin Clements @ 2012-12-28 18:26 UTC (permalink / raw)
  To: notmuch

When we switch to using regular Xapian queries in the dump format, \n
will cause problems, so we disallow it.  Specially, while Xapian can
quote and parse queries containing \n without difficultly, quoted
queries containing \n still span multiple lines, which breaks the
line-orientedness of the dump format.  Strictly speaking, we could
still round-trip these, but it would significantly complicate restore
as well as scripts that deal with tag dumps.  This complexity would
come at absolutely no benefit: because of the RFC 2822 unfolding
rules, no amount of standards negligence can produce a message with a
message ID containing a line break (not even Outlook can do it!).

Hence, we simply disallow it.
---
 notmuch-dump.c       |    9 +++++++++
 test/random-corpus.c |    4 +++-
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/notmuch-dump.c b/notmuch-dump.c
index d2dad40..29d79da 100644
--- a/notmuch-dump.c
+++ b/notmuch-dump.c
@@ -132,6 +132,15 @@ notmuch_dump_command (unused (void *ctx), int argc, char *argv[])
 	if (output_format == DUMP_FORMAT_SUP) {
 	    fputs (")\n", output);
 	} else {
+	    if (strchr (message_id, '\n')) {
+		/* This will produce a line break in the output, which
+		 * would be difficult to handle in tools.  However,
+		 * it's also impossible to produce an email containing
+		 * a line break in a message ID because of unfolding,
+		 * so we can safely disallow it. */
+		fprintf (stderr, "Error: cannot dump message id containing line break: %s\n", message_id);
+		return 1;
+	    }
 	    if (hex_encode (notmuch, message_id,
 			    &buffer, &buffer_size) != HEX_SUCCESS) {
 		    fprintf (stderr, "Error: failed to hex-encode msg-id %s\n",
diff --git a/test/random-corpus.c b/test/random-corpus.c
index f354d4b..8b7748e 100644
--- a/test/random-corpus.c
+++ b/test/random-corpus.c
@@ -96,7 +96,9 @@ random_utf8_string (void *ctx, size_t char_count)
 	    buf = talloc_realloc (ctx, buf, gchar, buf_size);
 	}
 
-	randomchar = random_unichar ();
+	do {
+	    randomchar = random_unichar ();
+	} while (randomchar == '\n');
 
 	written = g_unichar_to_utf8 (randomchar, buf + offset);
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 4/5] dump/restore: Use Xapian queries for batch-tag format
  2012-12-28 18:26 [PATCH v3 0/5] Use Xapian query syntax for batch-tag dump/restore Austin Clements
                   ` (2 preceding siblings ...)
  2012-12-28 18:26 ` [PATCH v3 3/5] dump: Disallow \n in message IDs Austin Clements
@ 2012-12-28 18:26 ` Austin Clements
  2012-12-28 18:26 ` [PATCH v3 5/5] man: Update notmuch-dump(1) and notmuch-restore(1) Austin Clements
  4 siblings, 0 replies; 9+ messages in thread
From: Austin Clements @ 2012-12-28 18:26 UTC (permalink / raw)
  To: notmuch

This switches the new batch-tag format away from using a home-grown
hex-encoding scheme for message IDs in the dump to simply using Xapian
queries with Xapian quoting syntax.

This has a variety of advantages beyond presenting a cleaner and more
consistent interface.  Foremost is that it will dramatically simplify
the quoting for batch tagging, which shares the same input format.
While the hex-encoding is no better or worse for the simple ID queries
used by dump/restore, it becomes onerous for general-purpose queries
used in batch tagging.  It also better handles strange cases like
"id:foo and bar", since this is no longer syntactically valid.
---
 notmuch-dump.c    |    9 +++++----
 notmuch-restore.c |   22 ++++++++++------------
 tag-util.c        |    6 ------
 test/dump-restore |   14 ++++++++------
 4 files changed, 23 insertions(+), 28 deletions(-)

diff --git a/notmuch-dump.c b/notmuch-dump.c
index 29d79da..bf01a39 100644
--- a/notmuch-dump.c
+++ b/notmuch-dump.c
@@ -20,6 +20,7 @@
 
 #include "notmuch-client.h"
 #include "dump-restore-private.h"
+#include "string-util.h"
 
 int
 notmuch_dump_command (unused (void *ctx), int argc, char *argv[])
@@ -141,13 +142,13 @@ notmuch_dump_command (unused (void *ctx), int argc, char *argv[])
 		fprintf (stderr, "Error: cannot dump message id containing line break: %s\n", message_id);
 		return 1;
 	    }
-	    if (hex_encode (notmuch, message_id,
-			    &buffer, &buffer_size) != HEX_SUCCESS) {
-		    fprintf (stderr, "Error: failed to hex-encode msg-id %s\n",
+	    if (make_boolean_term (notmuch, "id", message_id,
+				   &buffer, &buffer_size)) {
+		    fprintf (stderr, "Error: failed to quote message id %s\n",
 			     message_id);
 		    return 1;
 	    }
-	    fprintf (output, " -- id:%s\n", buffer);
+	    fprintf (output, " -- %s\n", buffer);
 	}
 
 	notmuch_message_destroy (message);
diff --git a/notmuch-restore.c b/notmuch-restore.c
index 9ed9b51..77a4c27 100644
--- a/notmuch-restore.c
+++ b/notmuch-restore.c
@@ -207,7 +207,7 @@ notmuch_restore_command (unused (void *ctx), int argc, char *argv[])
 	    INTERNAL_ERROR ("compile time constant regex failed.");
 
     do {
-	char *query_string;
+	char *query_string, *prefix, *term;
 
 	if (line_ctx != NULL)
 	    talloc_free (line_ctx);
@@ -220,19 +220,17 @@ notmuch_restore_command (unused (void *ctx), int argc, char *argv[])
 				  &query_string, tag_ops);
 
 	    if (ret == 0) {
-		if (strncmp ("id:", query_string, 3) != 0) {
-		    fprintf (stderr, "Warning: unsupported query: %s\n", query_string);
+		ret = parse_boolean_term (line_ctx, query_string,
+					  &prefix, &term);
+		if (ret) {
+		    fprintf (stderr, "Warning: cannot parse query: %s\n",
+			     query_string);
+		    continue;
+		} else if (strcmp ("id", prefix) != 0) {
+		    fprintf (stderr, "Warning: not an id query: %s\n", query_string);
 		    continue;
 		}
-		/* delete id: from front of string; tag_message
-		 * expects a raw message-id.
-		 *
-		 * XXX: Note that query string id:foo and bar will be
-		 * interpreted as a message id "foo and bar". This
-		 * should eventually be fixed to give a better error
-		 * message.
-		 */
-		query_string = query_string + 3;
+		query_string = term;
 	    }
 	}
 
diff --git a/tag-util.c b/tag-util.c
index 705b7ba..e4e5dda 100644
--- a/tag-util.c
+++ b/tag-util.c
@@ -124,12 +124,6 @@ parse_tag_line (void *ctx, char *line,
     }
 
     /* tok now points to the query string */
-    if (hex_decode_inplace (tok) != HEX_SUCCESS) {
-	ret = line_error (TAG_PARSE_INVALID, line_for_error,
-			  "hex decoding of query %s failed", tok);
-	goto DONE;
-    }
-
     *query_string = tok;
 
   DONE:
diff --git a/test/dump-restore b/test/dump-restore
index 6a989b6..f9ae5b3 100755
--- a/test/dump-restore
+++ b/test/dump-restore
@@ -195,23 +195,25 @@ a
 
 # the previous line was blank; also no yelling please
 +%zz -- id:whatever
-+e +f id:%yy
++e +f id:"
++e +f tag:abc
 # the next non-comment line should report an an empty tag error for
 # batch tagging, but not for restore
 + +e -- id:20091117232137.GA7669@griffis1.net
-# highlight the sketchy id parsing; this should be last
-+g -- id:foo and bar
+# valid id, but warning about missing message
++e id:missing_message_id
 EOF
 
 cat <<EOF > EXPECTED
-Warning: unsupported query: a
+Warning: cannot parse query: a
 Warning: no query string [+0]
 Warning: no query string [+a +b]
 Warning: missing query string [+a +b ]
 Warning: no query string after -- [+c +d --]
 Warning: hex decoding of tag %zz failed [+%zz -- id:whatever]
-Warning: hex decoding of query id:%yy failed [+e +f id:%yy]
-Warning: cannot apply tags to missing message: foo and bar
+Warning: cannot parse query: id:"
+Warning: not an id query: tag:abc
+Warning: cannot apply tags to missing message: missing_message_id
 EOF
 
 test_expect_equal_file EXPECTED OUTPUT
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 5/5] man: Update notmuch-dump(1) and notmuch-restore(1)
  2012-12-28 18:26 [PATCH v3 0/5] Use Xapian query syntax for batch-tag dump/restore Austin Clements
                   ` (3 preceding siblings ...)
  2012-12-28 18:26 ` [PATCH v3 4/5] dump/restore: Use Xapian queries for batch-tag format Austin Clements
@ 2012-12-28 18:26 ` Austin Clements
  4 siblings, 0 replies; 9+ messages in thread
From: Austin Clements @ 2012-12-28 18:26 UTC (permalink / raw)
  To: notmuch

Describe the new batch-tag format.  For notmuch-restore, rather than
half-heartedly duplicating the description, we now cite notmuch-dump.
---
 man/man1/notmuch-dump.1    |   11 +++++++----
 man/man1/notmuch-restore.1 |    6 ++----
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/man/man1/notmuch-dump.1 b/man/man1/notmuch-dump.1
index 770b00f..7bd6def 100644
--- a/man/man1/notmuch-dump.1
+++ b/man/man1/notmuch-dump.1
@@ -64,13 +64,16 @@ and tags containing whitespace or non-\fBascii\fR(7) characters.
 Each line has the form
 
 .RS 4
-.RI "+<" "encoded-tag" "> " "" "+<" "encoded-tag" "> ... -- " "" " id:<" encoded-message-id >
+.RI "+<" "encoded-tag" "> " "" "+<" "encoded-tag" "> ... -- " "" " id:<" quoted-message-id >
 
-where encoded means that every byte not matching the regex
+Tags are hex-encoded by replacing every byte not matching the regex
 .B [A-Za-z0-9@=.,_+-]
-is replace by
+with
 .B %nn
-where nn is the two digit hex encoding.
+where nn is the two digit hex encoding.  The message ID is a valid Xapian
+query, quoted using Xapian boolean term quoting rules: if the ID contains
+whitespace or a close paren or starts with a double quote, it must be
+enclosed in double quotes and double quotes inside the ID must be doubled.
 The astute reader will notice this is a special case of the batch input
 format for \fBnotmuch-tag\fR(1); note that the single message-id query is
 mandatory for \fBnotmuch-restore\fR(1).
diff --git a/man/man1/notmuch-restore.1 b/man/man1/notmuch-restore.1
index 6bba628..78fef52 100644
--- a/man/man1/notmuch-restore.1
+++ b/man/man1/notmuch-restore.1
@@ -57,10 +57,8 @@ sup calls them).
 The
 .B batch-tag
 dump format is intended to more robust against malformed message-ids
-and tags containing whitespace or non-\fBascii\fR(7) characters.  This
-format hex-escapes all characters those outside of a small character
-set, intended to be suitable for e.g. pathnames in most UNIX-like
-systems.
+and tags containing whitespace or non-\fBascii\fR(7) characters.  See
+\fBnotmuch-dump\fR(1) for details on this format.
 
 .B "notmuch restore"
 updates the maildir flags according to tag changes if the
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 2/5] util: Function to parse boolean term queries
  2012-12-28 18:26 ` [PATCH v3 2/5] util: Function to parse boolean term queries Austin Clements
@ 2012-12-29 16:50   ` Mark Walters
  2012-12-29 21:58   ` Tomi Ollila
  1 sibling, 0 replies; 9+ messages in thread
From: Mark Walters @ 2012-12-29 16:50 UTC (permalink / raw)
  To: Austin Clements, notmuch


On Fri, 28 Dec 2012, Austin Clements <amdragon@MIT.EDU> wrote:
> This parses the subset of Xapian's boolean term quoting rules that are
> used by make_boolean_term.  This is provided as a generic string
> utility, but will be used shortly in notmuch restore to parse and
> optimize for ID queries.
> ---
>  util/string-util.c |   55 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  util/string-util.h |   11 +++++++++++
>  2 files changed, 66 insertions(+)
>
> diff --git a/util/string-util.c b/util/string-util.c
> index e4bea21..83b4953 100644
> --- a/util/string-util.c
> +++ b/util/string-util.c
> @@ -96,3 +96,58 @@ make_boolean_term (void *ctx, const char *prefix, const char *term,
>  
>      return 0;
>  }
> +
> +int
> +parse_boolean_term (void *ctx, const char *str,
> +		    char **prefix_out, char **term_out)
> +{
> +    *prefix_out = *term_out = NULL;
> +
> +    /* Parse prefix */
> +    const char *pos = strchr (str, ':');
> +    if (! pos)
> +	goto FAIL;
> +    *prefix_out = talloc_strndup (ctx, str, pos - str);
> +    ++pos;
> +
> +    /* Implement de-quoting compatible with make_boolean_term. */
> +    if (*pos == '"') {
> +	char *out = talloc_array (ctx, char, strlen (pos));
> +	int closed = 0;
> +	*term_out = out;
> +	/* Skip the opening quote, find the closing quote, and
> +	 * un-double doubled internal quotes. */
> +	for (++pos; *pos; ) {
> +	    if (*pos == '"') {
> +		++pos;
> +		if (*pos != '"') {
> +		    /* Found the closing quote. */
> +		    closed = 1;
> +		    break;
> +		}
> +	    }
> +	    *out++ = *pos++;
> +	}
> +	/* Did the term terminate without a closing quote or is there
> +	 * trailing text after the closing quote? */
> +	if (!closed || *pos)
> +	    goto FAIL;
> +	*out = '\0';
> +    } else {
> +	const char *start = pos;
> +	/* Check for text after the boolean term. */
> +	while (*pos > ' ' && *pos != ')')
> +	    ++pos;
> +	if (*pos)
> +	    goto FAIL;
> +	/* No trailing text; dup the string so the caller can free
> +	 * it. */
> +	*term_out = talloc_strdup (ctx, start);
> +    }
> +    return 0;

This looks like it will fail if a line contains trailing white space
(and that seemed to be the case from my experiment). Is that wanted?

If yes we may want to make the error message clearer as I got an error
of the form
Warning: cannot parse query: id:message-id@example.com   
and the fact that line contained trailing spaces was not clear.

> +
> + FAIL:
> +    talloc_free (*prefix_out);
> +    talloc_free (*term_out);
> +    return 1;
> +}
> diff --git a/util/string-util.h b/util/string-util.h
> index b8844a3..43d49d0 100644
> --- a/util/string-util.h
> +++ b/util/string-util.h
> @@ -33,4 +33,15 @@ char *strtok_len (char *s, const char *delim, size_t *len);
>  int make_boolean_term (void *talloc_ctx, const char *prefix, const char *term,
>  		       char **buf, size_t *len);
>  
> +/* Parse a boolean term query produced by make_boolean_term, returning
> + * the prefix in *prefix_out and the term in *term_out.  *prefix_out
> + * and *term_out will be talloc'd with context ctx.
> + *
> + * Return: 0 on success, non-zero on parse error (including trailing
> + * data in str).
> + */

I think it would be worth detailing what requirements a boolean term
here should meet to be parsed.  In particular, since make_boolean_term
is more restrictive than Xapian we should say whether this is like
make_boolean_term, like Xapian or somewhere in between.


Best wishes

Mark


> +int
> +parse_boolean_term (void *ctx, const char *str,
> +		    char **prefix_out, char **term_out);
> +
>  #endif
> -- 
> 1.7.10.4

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 2/5] util: Function to parse boolean term queries
  2012-12-28 18:26 ` [PATCH v3 2/5] util: Function to parse boolean term queries Austin Clements
  2012-12-29 16:50   ` Mark Walters
@ 2012-12-29 21:58   ` Tomi Ollila
  2012-12-31  5:52     ` Austin Clements
  1 sibling, 1 reply; 9+ messages in thread
From: Tomi Ollila @ 2012-12-29 21:58 UTC (permalink / raw)
  To: Austin Clements, notmuch

On Fri, Dec 28 2012, Austin Clements <amdragon@MIT.EDU> wrote:

> This parses the subset of Xapian's boolean term quoting rules that are
> used by make_boolean_term.  This is provided as a generic string
> utility, but will be used shortly in notmuch restore to parse and
> optimize for ID queries.
> ---
>  util/string-util.c |   55 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  util/string-util.h |   11 +++++++++++
>  2 files changed, 66 insertions(+)
>
> diff --git a/util/string-util.c b/util/string-util.c
> index e4bea21..83b4953 100644
> --- a/util/string-util.c
> +++ b/util/string-util.c
> @@ -96,3 +96,58 @@ make_boolean_term (void *ctx, const char *prefix, const char *term,
>  
>      return 0;
>  }
> +
> +int
> +parse_boolean_term (void *ctx, const char *str,
> +		    char **prefix_out, char **term_out)
> +{
> +    *prefix_out = *term_out = NULL;
> +
> +    /* Parse prefix */
> +    const char *pos = strchr (str, ':');
> +    if (! pos)
> +	goto FAIL;
> +    *prefix_out = talloc_strndup (ctx, str, pos - str);
> +    ++pos;
> +
> +    /* Implement de-quoting compatible with make_boolean_term. */
> +    if (*pos == '"') {
> +	char *out = talloc_array (ctx, char, strlen (pos));
> +	int closed = 0;
> +	*term_out = out;
> +	/* Skip the opening quote, find the closing quote, and
> +	 * un-double doubled internal quotes. */
> +	for (++pos; *pos; ) {
> +	    if (*pos == '"') {
> +		++pos;
> +		if (*pos != '"') {
> +		    /* Found the closing quote. */
> +		    closed = 1;
> +		    break;
> +		}
> +	    }
> +	    *out++ = *pos++;
> +	}
> +	/* Did the term terminate without a closing quote or is there
> +	 * trailing text after the closing quote? */
> +	if (!closed || *pos)
> +	    goto FAIL;
> +	*out = '\0';
> +    } else {
> +	const char *start = pos;
> +	/* Check for text after the boolean term. */
> +	while (*pos > ' ' && *pos != ')')
> +	    ++pos;
> +	if (*pos)
> +	    goto FAIL;

Mark pointed out a good case about trailing whitespace -- It would be nice
if the core were lenient for such cases. I personally remember once wasting
hours of work by just failing to notice trailing whitespace in one system
so this subject is sensitive to me...

Another thing I saw earlyer today: make_boolean_term() checks

   if (*in <= ' ' || *in == ')' || *in == '"' || (unsigned char)*in > 127)

but here the check is only

   while (*pos > ' ' && *pos != ')')

I wonder whether it matters...

Everyting else looks good to me.


Tomi

> +	/* No trailing text; dup the string so the caller can free
> +	 * it. */
> +	*term_out = talloc_strdup (ctx, start);
> +    }
> +    return 0;
> +
> + FAIL:
> +    talloc_free (*prefix_out);
> +    talloc_free (*term_out);
> +    return 1;
> +}
> diff --git a/util/string-util.h b/util/string-util.h
> index b8844a3..43d49d0 100644
> --- a/util/string-util.h
> +++ b/util/string-util.h
> @@ -33,4 +33,15 @@ char *strtok_len (char *s, const char *delim, size_t *len);
>  int make_boolean_term (void *talloc_ctx, const char *prefix, const char *term,
>  		       char **buf, size_t *len);
>  
> +/* Parse a boolean term query produced by make_boolean_term, returning
> + * the prefix in *prefix_out and the term in *term_out.  *prefix_out
> + * and *term_out will be talloc'd with context ctx.
> + *
> + * Return: 0 on success, non-zero on parse error (including trailing
> + * data in str).
> + */
> +int
> +parse_boolean_term (void *ctx, const char *str,
> +		    char **prefix_out, char **term_out);
> +
>  #endif
> -- 
> 1.7.10.4
>
> _______________________________________________
> notmuch mailing list
> notmuch@notmuchmail.org
> http://notmuchmail.org/mailman/listinfo/notmuch

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 2/5] util: Function to parse boolean term queries
  2012-12-29 21:58   ` Tomi Ollila
@ 2012-12-31  5:52     ` Austin Clements
  0 siblings, 0 replies; 9+ messages in thread
From: Austin Clements @ 2012-12-31  5:52 UTC (permalink / raw)
  To: Tomi Ollila, notmuch

On Sat, 29 Dec 2012, Tomi Ollila <tomi.ollila@iki.fi> wrote:
> On Fri, Dec 28 2012, Austin Clements <amdragon@MIT.EDU> wrote:
>
>> This parses the subset of Xapian's boolean term quoting rules that are
>> used by make_boolean_term.  This is provided as a generic string
>> utility, but will be used shortly in notmuch restore to parse and
>> optimize for ID queries.
>> ---
>>  util/string-util.c |   55 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  util/string-util.h |   11 +++++++++++
>>  2 files changed, 66 insertions(+)
>>
>> diff --git a/util/string-util.c b/util/string-util.c
>> index e4bea21..83b4953 100644
>> --- a/util/string-util.c
>> +++ b/util/string-util.c
>> @@ -96,3 +96,58 @@ make_boolean_term (void *ctx, const char *prefix, const char *term,
>>  
>>      return 0;
>>  }
>> +
>> +int
>> +parse_boolean_term (void *ctx, const char *str,
>> +		    char **prefix_out, char **term_out)
>> +{
>> +    *prefix_out = *term_out = NULL;
>> +
>> +    /* Parse prefix */
>> +    const char *pos = strchr (str, ':');
>> +    if (! pos)
>> +	goto FAIL;
>> +    *prefix_out = talloc_strndup (ctx, str, pos - str);
>> +    ++pos;
>> +
>> +    /* Implement de-quoting compatible with make_boolean_term. */
>> +    if (*pos == '"') {
>> +	char *out = talloc_array (ctx, char, strlen (pos));
>> +	int closed = 0;
>> +	*term_out = out;
>> +	/* Skip the opening quote, find the closing quote, and
>> +	 * un-double doubled internal quotes. */
>> +	for (++pos; *pos; ) {
>> +	    if (*pos == '"') {
>> +		++pos;
>> +		if (*pos != '"') {
>> +		    /* Found the closing quote. */
>> +		    closed = 1;
>> +		    break;
>> +		}
>> +	    }
>> +	    *out++ = *pos++;
>> +	}
>> +	/* Did the term terminate without a closing quote or is there
>> +	 * trailing text after the closing quote? */
>> +	if (!closed || *pos)
>> +	    goto FAIL;
>> +	*out = '\0';
>> +    } else {
>> +	const char *start = pos;
>> +	/* Check for text after the boolean term. */
>> +	while (*pos > ' ' && *pos != ')')
>> +	    ++pos;
>> +	if (*pos)
>> +	    goto FAIL;
>
> Mark pointed out a good case about trailing whitespace -- It would be nice
> if the core were lenient for such cases. I personally remember once wasting
> hours of work by just failing to notice trailing whitespace in one system
> so this subject is sensitive to me...

Will do.

> Another thing I saw earlyer today: make_boolean_term() checks
>
>    if (*in <= ' ' || *in == ')' || *in == '"' || (unsigned char)*in > 127)
>
> but here the check is only
>
>    while (*pos > ' ' && *pos != ')')
>
> I wonder whether it matters...

This is correct in a conservative way.  We quote things that we don't
strictly need to quote (for example, a double quote in the middle of a
term doesn't actually require the term to be quoted, but we do anyway),
but we look for the end of an unquoted term in exactly the same way
Xapian does.

> Everyting else looks good to me.
>
>
> Tomi
>
>> +	/* No trailing text; dup the string so the caller can free
>> +	 * it. */
>> +	*term_out = talloc_strdup (ctx, start);
>> +    }
>> +    return 0;
>> +
>> + FAIL:
>> +    talloc_free (*prefix_out);
>> +    talloc_free (*term_out);
>> +    return 1;
>> +}
>> diff --git a/util/string-util.h b/util/string-util.h
>> index b8844a3..43d49d0 100644
>> --- a/util/string-util.h
>> +++ b/util/string-util.h
>> @@ -33,4 +33,15 @@ char *strtok_len (char *s, const char *delim, size_t *len);
>>  int make_boolean_term (void *talloc_ctx, const char *prefix, const char *term,
>>  		       char **buf, size_t *len);
>>  
>> +/* Parse a boolean term query produced by make_boolean_term, returning
>> + * the prefix in *prefix_out and the term in *term_out.  *prefix_out
>> + * and *term_out will be talloc'd with context ctx.
>> + *
>> + * Return: 0 on success, non-zero on parse error (including trailing
>> + * data in str).
>> + */
>> +int
>> +parse_boolean_term (void *ctx, const char *str,
>> +		    char **prefix_out, char **term_out);
>> +
>>  #endif
>> -- 
>> 1.7.10.4
>>
>> _______________________________________________
>> notmuch mailing list
>> notmuch@notmuchmail.org
>> http://notmuchmail.org/mailman/listinfo/notmuch

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-12-31  5:52 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-28 18:26 [PATCH v3 0/5] Use Xapian query syntax for batch-tag dump/restore Austin Clements
2012-12-28 18:26 ` [PATCH v3 1/5] util: Factor out boolean term quoting routine Austin Clements
2012-12-28 18:26 ` [PATCH v3 2/5] util: Function to parse boolean term queries Austin Clements
2012-12-29 16:50   ` Mark Walters
2012-12-29 21:58   ` Tomi Ollila
2012-12-31  5:52     ` Austin Clements
2012-12-28 18:26 ` [PATCH v3 3/5] dump: Disallow \n in message IDs Austin Clements
2012-12-28 18:26 ` [PATCH v3 4/5] dump/restore: Use Xapian queries for batch-tag format Austin Clements
2012-12-28 18:26 ` [PATCH v3 5/5] man: Update notmuch-dump(1) and notmuch-restore(1) Austin Clements

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).