From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id 817326DE01D0 for ; Thu, 2 Jun 2016 10:34:08 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: -0.02 X-Spam-Level: X-Spam-Status: No, score=-0.02 tagged_above=-999 required=5 tests=[AWL=-0.020] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 8i_EcJL7dRda for ; Thu, 2 Jun 2016 10:34:00 -0700 (PDT) Received: from che.mayfirst.org (che.mayfirst.org [162.247.75.118]) by arlo.cworth.org (Postfix) with ESMTP id 6CEF96DE00DB for ; Thu, 2 Jun 2016 10:34:00 -0700 (PDT) Received: from fifthhorseman.net (unknown [38.109.115.130]) by che.mayfirst.org (Postfix) with ESMTPSA id 5AEBDF98B; Thu, 2 Jun 2016 13:33:58 -0400 (EDT) Received: by fifthhorseman.net (Postfix, from userid 1000) id 3AE6020245; Thu, 2 Jun 2016 13:33:58 -0400 (EDT) From: Daniel Kahn Gillmor To: David Bremner , notmuch@notmuchmail.org Subject: Re: [RFC2 Patch 5/5] lib: iterator API for message properties In-Reply-To: <87lh2ofpxk.fsf@zancas.localnet> References: <1463927339-5441-1-git-send-email-david@tethera.net> <1464608999-14774-1-git-send-email-david@tethera.net> <1464608999-14774-6-git-send-email-david@tethera.net> <8760tthfuy.fsf@zancas.localnet> <87pos1u14p.fsf@alice.fifthhorseman.net> <87eg8ht2sb.fsf@alice.fifthhorseman.net> <87lh2ofpxk.fsf@zancas.localnet> User-Agent: Notmuch/0.22+16~g87b7bd4 (http://notmuchmail.org) Emacs/24.5.1 (x86_64-pc-linux-gnu) Date: Thu, 02 Jun 2016 13:33:54 -0400 Message-ID: <87inxrqyv1.fsf@alice.fifthhorseman.net> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha512; protocol="application/pgp-signature" X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 02 Jun 2016 17:34:08 -0000 --=-=-= Content-Type: text/plain Content-Transfer-Encoding: quoted-printable Hi Bremner-- thanks for the response! I didn't mean my post to be a wet-blanket, just wanted to think through the tradeoffs... On Wed 2016-06-01 19:29:59 -0400, David Bremner wrote: > I guess if you don't care about the possibility of iterating all pairs > with given key prefix (which I admit makes more sense for the config > API), then the code could be simplified to look more like the tag list > handling code. C is pretty crap at generics, but I guess looking at > tags.c, it's really about iterators for notmuch_string_list_t. So it > could probably be generalized to serve here. > > For each such prefix, one would need to roughly duplicate patches 1/5 > and 3/5. It took me a little while to figure 1/5 out, but now that I > know, it would be less trouble. I guess my thinking here was that I > would provide a low level interface that people using the C API or > bindings could use without hacking xapian. [...] > XPROPERTY is an internal prefix, which means it isn't added to the query > parser. As it happens, I didn't plan on CLI access to these terms > either. Both of those choices are tradeoffs to say that these are > internal metadata, suitable for manipulation by programs. Such programs > could be scripts using python or ruby. I think this makes sense, and makes me more comfortable with the overall idea of this patch series. maybe it'd be useful to clearly document the intended scope? >> If we add new specific features, we could potentially augment the dump >> format explicitly for them, without having the property abstraction. > > We could, but I think should change the dump format quite rarely, since > we risk breaking people's scripts. So if we did it for one prefix, I'd > like to do in an extensible way so that adding new prefixes is somewhat > transparent. It also means some duplication of effort/code in notmuch > dump/restore to dump/restore each new prefix. > > It's probably true that per-prefix dump format would be more compact, > since the keys would be implicit, rather than repeated for every pair. true, though i'm not sure how much compactness is necessary. presumably people are compressing their dumpfiles, and regularly repeated strings are the easiest thing to compress. >> We already have some explicit features for each message (subject, >> from, to, attachment, mimetype, thread id, etc), and most of them are >> derived from the message itself, with the hope that it could be >> re-derived given just the message body. Is there a distinction >> between properties that can be derived from the message body and >> properties that need to be additionally derived from some other data? > > As Tomi always says, naming is the hardest thing; properties is a bit > generic. I'm not sure the distinction you make between the "message" and > the "message body" here. I think most of our derived terms are from the > message header. My intent here is that "properties" are used for things > that cannot be derived from the message (header or body). To be clear, i didn't mean to distinguish betweeen "message" and "message body" -- i don't think of the headers as being significantly different from the body (and indeed, if we can get memoryhole working, then some headers might be derived from or influenced by the body). maybe it's worth thinking through each of these per-message features, and where they come from -- are they from the message itself (header, body, etc), from the message's position(s) in the filesystem, or somewhere else entirely? From=20the message: * message-id * subject * mimetype * attachment * references * from * to * replyto From=20the filesystem itself: * filenames * folder From=20elsewhere: * for messages which have multiple files, which file is actually indexed * thread-id * tag we're now talking about adding properties, which are in the "elsewhere" category, right? It's worth noticing that the stuff in "elsewhere" is the stuff that won't propagate across a dump/restore unless it's explicitly in the dump somehow. We currently fail to restore thread-id and which file is actually indexed across a dump/restore :/ > - per prefix requires new code in the library and dump/restore > for every prefix > + the dump format might be more compact if done in a per prefix way. > + this code would be simpler than the generic properties code, > mainly because it would not need key value pairs, > - the library and dump/restore are parts of notmuch that have the > potential to "break the world". Not too many people are > comfortable hacking on them. > - changing the dump format is something like an ABI change for > people whose scripts rely on dump / restore. I think you've convinced me that it's good to go ahead with the properties, assuming it's scoped as defined above. I still think that we need a better story for upgrades to the dump format in general, but maybe this isn't the place to make that particular case. --dkg --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQJ8BAEBCgBmBQJXUG4CXxSAAAAAAC4AKGlzc3Vlci1mcHJAbm90YXRpb25zLm9w ZW5wZ3AuZmlmdGhob3JzZW1hbi5uZXRFREIyRTc0RjU2RkNGMkI2NzI5N0I3MzUy NEVDRkY1QUZGNjgzNzBBAAoJECTs/1r/aDcKAYIP/0gctfr4FyAaVpazBvRPjyC0 BgAhDO7Wv3V1G88m4spGUcH5yVuWFPhZ+bQedcPD0pExloo3ax21dxlaNiS1/qVE FtTMkxQbUXHVcDqoYeu4XNBKMng1KSbNJuQ2LHq4g/88ytEKXvcCz7qTbNd6tTQ+ LGlb101PdRJXtbU3MjLn86/Ehomt+AqqxYYDFMaRkDUEgaQPSzWe+H+V5nSWKZJG xthvfzoElvAXM1sAKNPosBQI2s5k87vn43mXSrKfzNZ0OCTn+Wf4os17ARCEVKA5 PuecT1eXsfwF/R0rj7LPuPLlU3HvSKkPaL32SQsDTRwUgOqF6Cu1FMnzsL7aPXwf I3wAzfsn4/x7XEfzD3Mot0LhFiS6Iahu7djEshuoxyfUtfHrOLpZy6qDWs5bLyp2 WFJx0zWP2hHoA0HqoabU+38riCiyii6Dq8Fo6qps1UGtX+2IsLVzFQq9599B8845 8J5EU6Upf1zDPcsRCrLoEP4ePtb2eAZzeCmR5ENSFlme+Z9xK66dwEAzxeW3j+3L 4eES+Ft6FXlh7ZMhFv0sssYcXcCdAn9Umvtvt0Q98d7J1/KLEX1HIl4WQJTkTEy8 Ni5x7Zf9hS/M/f4jAJH0M6i366RgI8hQUN9FL0/9Bfq4+pqV6ZjLTAwEiT1GQiUb y3aH14NDyvKyhlF8Z4Ty =uF1c -----END PGP SIGNATURE----- --=-=-=--