Extending the ecomplete.el data store.

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Extending the ecomplete.el data store.
@ 2018-02-04  6:16 Karl Fogel
  2018-02-04 22:33 ` Stefan Monnier
  2018-02-05  9:40 ` Extending the ecomplete.el data store Lars Ingebrigtsen
  0 siblings, 2 replies; 20+ messages in thread
From: Karl Fogel @ 2018-02-04  6:16 UTC (permalink / raw)
  To: Emacs Devel

This post's primary audience is Lars Ingebrigtsen -- we agreed to move this thread over here from Emacs Tangents [1] -- though of course anyone's welcome to join in.

Some context for everyone else: after I wrote mailaprop [2] to do prioritized autofill for email addresses, Lars mentioned his ecomplete.el, which is part of Emacs.  Ecomplete offers similar functionality, although its UI is minibuffer-based rather than tooltip-based, and it uses a different address prioritization algorithm from mailaprop.

Lars, I'd like to propose extending the data stored by ecomplete.el so that it supports the union of the data needed by ecomplete and that needed by mailaprop.  (Mostly what mailaprop stores is a superset of what ecomplete stores, with one exception; more on that below.)

An ecomplete record looks like this:

  (KEY  TIMES_USED  LAST_TIME_USED  STRING)

Here is an example (the `mail' at the front is so that you could have one alist of things for `mail' and another for, say, `twitter', etc):

  ((mail
   ("larsi@example.com" 381 1516109510 "Lars Ingebrigtsen <larsi@example.com>")
   ("kfogel@example.com" 10 1516065455 "Karl Fogel <kfogel@example.com>")
   ...
   ))

Meanwhile, a mailaprop on-disk record looks like this:

  (KEY
   ((VARIANT  LAST_TIME_USED  SENT_COUNT  RECEIVED_COUNT)
    ...))

Here's an example of a key with three variants:

  ("a.szymanowski@example.com"
   (("a.szymanowski@example.com"                       "2017 Jun 12"  29 31)
    ("A. Szymanowski <a.szymanowski@example.com>"      "2017 Sep 03"   1  0)
    ("Abilene Szymanowski <A.Szymanowski@example.com>" "2018 Jan 15"   8  7)))

Let's ignore the fact that ecomplete stores dates as seconds-since-epoch while mailaprop uses human-readable strings; I'd be happy to switch mailaprop to the ecomplete way for that.  We'll just focus on substantive differences here.

At the individual record level, the mailaprop information is a superset of the ecomplete information in two ways:

* Mailaprop remembers all the real-name variations and case variations individually, including case variations in the email address portion as well as in the real name portion.  So each variation gets its own record, but they're all tied together under the same case-folded KEY so they can be scored together.  (Contrast with ecomplete, where I believe `ecomplete-add-item' just remembers the most recently-seen variant for a given key.)

* Mailaprop splits the TIMES-USED into SENT_COUNT and RECEIVED_COUNT, that is, number of times the user has sent to the address in question, and number of times the user has received mail from the user in question.

At the next level up, ecomplete stores a piece of information that mailaprop does not:

* Ecomplete starts the alist with a symbol that offers the possibility of multiple types of records, e.g., `mail', `twitter', etc.

So, here's a proposal for a unified format that supports both packages -- this format is more verbose but more extensible:

  (KEY          ; string: downcased email addr
    ((VARIANT   ; string: case-preserving address w/ real name
       (TYPE                                     ; symbol: `mail', etc
         ('last-sent  LAST_TIME_SENT_TO)          ; int: seconds since epoch
         ('last-recv  LAST_TIME_RECEIVED_FROM_TO) ; int: seconds since epoch
         ('sent-count SENT_COUNT)                 ; int: total times sent
         ('recv-count RECEIVED_COUNT)             ; int: total times received
       )
       ...further TYPEs could go here...
     )
     ...further VARIANTs here...
    )
    ...[reserved, in case we ever need something other than VARIANTs]...
  )

That's the format for one record; the master record file is just a list of elements of the above type.

This format offers many possibilities for creative scoring mechanisms, and is more easily extensible than either package's current format.

If we unify the format, we should probably unify on one default record file too.  Right now, `ecomplete-database-file' defaults to ~/.ecompleterc or ~/.emacs.d/ecompleterc, whereas `mailaprop-address-file' doesn't default to anything -- the user must set it manually: email addresses are pretty private, and I didn't want to guess about what locations would be confidential enough.  I'd be happy to just have mailaprop use ecomplete's defaults for the database file, though.  The privacy concern can be addressed with documentation.

Now, about database maintenance:

Mailaprop adds new addresses to the database using a different mechanism than ecomplete uses.  Mailaprop users run an asynchronous script that reads all of their email and generates the database.  Ecomplete watches email as it comes and goes in Emacs, and automagically keeps its database up-to-date.  (I don't think ecomplete has any way to "catch up to the present" when you start using it; you just start out with no email addresses, and it watches everything you do from then on.)

These two methods of database maintenance are basically compatible.  In fact, one could use mailaprop's script to generate the database the first time, and then depend on ecomplete to keep it up-to-date after that.  As long as we document what's going on, and each package uses its current defaults, I think we're fine.  Those who use ecomplete will still get what they've been getting, and those who use mailaprop can either use the mailaprop way of periodically updating the database, or they can ask ecomplete to maintain it in real time for them (this might necessitate a trivial flag in ecomplete to get it to maintain the database while not offering completion, for those who want a mailaprop-style popup-autofill UI, but that's easy to do).

I guess we would also switch to UTF-8 for the coding system for the database?  (Right now `ecomplete-database-file-coding-system' defaults to `iso-2022-7bit'.)

Note that ecomplete would have to add code to convert the new on-disk format to the in-memory format that ecomplete currently uses.  That is, this function...

  (defun ecomplete-setup ()
    "Read the .ecompleterc file."
    (when (file-exists-p ecomplete-database-file)
      (with-temp-buffer
        (let ((coding-system-for-read ecomplete-database-file-coding-system))
          (insert-file-contents ecomplete-database-file)
          (setq ecomplete-database (read (current-buffer)))))))

...would need to be supplemented with something that does what `mailaprop-digest-raw-addresses' does in mailaprop, and the reverse for writing the data out.  Obviously, this proposed new format is pretty easily convertible to and from ecomplete's in-memory representation.

Whew, okay, those are my thoughts.  I'm not sure whether it makes sense to unify the two packages themselves ever, but in any case using the same on-disk format would be a good move.

Modifications or counterproposals welcome of course, and it's also perfectly okay to say "Thanks, but this isn't worth the trouble." :-).  These two packages are so close in functionality and data that it seems a shame for them not to share a datastore, but we may just decide it's too much effort.  If we decide that, we should at least put pointers in each package mentioning the other, and this thread, so future programmers at least have their attention drawn to the redundancy before making further enhancements.

Best regards,
-Karl

[1] https://lists.gnu.org/archive/html/emacs-tangents/2018-01/msg00023.html

[2] https://lists.gnu.org/archive/html/emacs-tangents/2018-01/msg00003.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Extending the ecomplete.el data store.
  2018-02-04  6:16 Extending the ecomplete.el data store Karl Fogel
@ 2018-02-04 22:33 ` Stefan Monnier
  2018-02-04 23:54   ` Karl Fogel
  2018-02-05  9:41   ` Lars Ingebrigtsen
  2018-02-05  9:40 ` Extending the ecomplete.el data store Lars Ingebrigtsen
  1 sibling, 2 replies; 20+ messages in thread
From: Stefan Monnier @ 2018-02-04 22:33 UTC (permalink / raw)
  To: emacs-devel

> Some context for everyone else: after I wrote mailaprop [2] to do
> prioritized autofill for email addresses, Lars mentioned his
> ecomplete.el, which is part of Emacs.  Ecomplete offers similar
> functionality, although its UI is minibuffer-based rather than
> tooltip-based, and it uses a different address prioritization
> algorithm from mailaprop.

FWIW, I recently installed a completion-table for ecomplete together
with a ecomplete completion-at-point-function for message.el, which
together let you use the ecomplete database for TAB completion as well
as for company-mode ("tooltip-like").

>   (KEY          ; string: downcased email addr
>     ((VARIANT   ; string: case-preserving address w/ real name
>        (TYPE                                     ; symbol: `mail', etc
>          ('last-sent  LAST_TIME_SENT_TO)          ; int: seconds since epoch
>          ('last-recv  LAST_TIME_RECEIVED_FROM_TO) ; int: seconds since epoch
>          ('sent-count SENT_COUNT)                 ; int: total times sent
>          ('recv-count RECEIVED_COUNT)             ; int: total times received
>        )
>        ...further TYPEs could go here...
>      )
>      ...further VARIANTs here...
>     )
>     ...[reserved, in case we ever need something other than VARIANTs]...
>   )

Can you show an example where the presence of multiple variants lets you
do something you can't do with the single variant?
I'm not sure I understand the benefits.

AFAICT the main difference here compared to the ecompleterc format is
that we impose the notion of "sending" and "receiving", whereas the
ecompleterc could conceivably be used for things fundamentally unrelated
to sending/receiving messages (e.g. completion of file names, say).

> These two methods of database maintenance are basically compatible.
> In fact, one could use mailaprop's script to generate the database the
> first time, and then depend on ecomplete to keep it up-to-date after
> that.  As long as we document what's going on, and each package uses
> its current defaults, I think we're fine.

I haven't used ecomplete very much so far, but I've noticed some issues
which I think are linked to having multiple Emacs sessions use it at the
same time.  I haven't investigated enough to be sure, but in any case
it's a use case that should be kept in mind.
[ And along vaguely related lines, I'd really like if the ecompleterc
  database could be somehow shared between my different machines.
  E.g. by arranging for git-merge to "do-the-right-thing" on it, or by
  storing (a copy of) it in IMAP.  ]

        Stefan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Extending the ecomplete.el data store.
  2018-02-04 22:33 ` Stefan Monnier
@ 2018-02-04 23:54   ` Karl Fogel
  2018-02-05  2:34     ` Stefan Monnier
  2018-02-05  9:41   ` Lars Ingebrigtsen
  1 sibling, 1 reply; 20+ messages in thread
From: Karl Fogel @ 2018-02-04 23:54 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:
>FWIW, I recently installed a completion-table for ecomplete together
>with a ecomplete completion-at-point-function for message.el, which
>together let you use the ecomplete database for TAB completion as well
>as for company-mode ("tooltip-like").

That's good to know, thanks.  All the more reason to centralize on a unified database of email addresses, containing all the information anyone might want, and have packages build functionality around that.

>>   (KEY          ; string: downcased email addr
>>     ((VARIANT   ; string: case-preserving address w/ real name
>>        (TYPE                                     ; symbol: `mail', etc
>>          ('last-sent  LAST_TIME_SENT_TO)          ; int: seconds since epoch
>>          ('last-recv  LAST_TIME_RECEIVED_FROM_TO) ; int: seconds since epoch
>>          ('sent-count SENT_COUNT)                 ; int: total times sent
>>          ('recv-count RECEIVED_COUNT)             ; int: total times received
>>        )
>>        ...further TYPEs could go here...
>>      )
>>      ...further VARIANTs here...
>>     )
>>     ...[reserved, in case we ever need something other than VARIANTs]...
>>   )
>
>Can you show an example where the presence of multiple variants lets you
>do something you can't do with the single variant?
>I'm not sure I understand the benefits.

Sure.  It's common to have these kinds of variants for one email address (note how subtle case variations can even appear in only the address portion):

  "Wutherington, Joanna - NYC" <joannaw@example.com>
  "Wutherington, Joanna" <joannaw@example.com>
  "JOANNA WUTHERINGTON" <JOANNAW@EXAMPLE.COM>
  "Joanna Wutherington" <joannaw@example.com>
  "Joanna Wutherington" <JoannaW@example.com>
  "joanna wutherington" <joannaw@example.com>
  "J. Wutherington" <joannaw@EXAMPLE.COM>
  "Joanna W." <joannaw@example.com>
  "Joanna W" <joannaw@example.com>
  "joannaw@example.com" <joannaw@example.com>
  ... etc, etc ...

(I've seen all of those variants before, and some addresses show up in my completion database with a significant number of those variants.  Oh, and sometimes they have double quotes and sometimes they don't.  Fun.)

There are many factors that can cause this kind of variation.  For example, when one is sending mail to that recipient, one might compose the email this way...

  "Joanna Wutherington" <joannaw@example.com>

...even though one has never actually received mail from them with that exact form of the address.  Maybe one copied-and-pasted the address from other sources, or whatever.  The point is, that might be the route by which that particular form gets into the completion database.

So the question is, when completing an address, which variant does the user want?

Mailaprop tries to figure out the "best" variant of a given address, and assign that variant a higher score than any of the other variants, so that the "best" one shows up higher in the completion list than any of those others.

The algorithm Mailaprop uses for determining "best" is not important here.  The point is just that in order to have an algorithm at all, the inputs have to be available.  Thus, the reason to have a format that preserves all these variants is so that packages (like ecomplete and mailaprop) can have enough information to try out interesting algorithms for autofill behavior.

As far as I know, ecomplete just always remembers the most-recently-seen variant.  That probably works well for most cases, but there will be times when

  "Joanna Wutherington" <joannaw@example.com>

gets replaced by (say)

  "joannaw@example.com" <joannaw@example.com>

...yet the user would almost certainly prefer the former.  Mailaprop preserves all the variants and scores them in order to avoid that situation.

>AFAICT the main difference here compared to the ecompleterc format is
>that we impose the notion of "sending" and "receiving", whereas the
>ecompleterc could conceivably be used for things fundamentally unrelated
>to sending/receiving messages (e.g. completion of file names, say).

Ah, yes -- so could mailaprop, come to think of it.  However, the TYPE indicator can govern what's in the variant list.  In other words, we can adjust the proposal so that the inner format is just for when TYPE == `mail'.  The inner format for other types has yet to be determined, because we don't know yet what kind of inputs they'll need to make good autofill behavior possible.

>I haven't used ecomplete very much so far, but I've noticed some issues
>which I think are linked to having multiple Emacs sessions use it at the
>same time.  I haven't investigated enough to be sure, but in any case
>it's a use case that should be kept in mind.

That's probably related only to how ecomplete generates its database; I don't think it affects the format of the database.  I.e., if one wants to be able to "splice new things in" to the database in memory, and write the database out at the end of the session, that's not significantly harder with the proposed new format than with the old one, and any multiple-session or cross-machine synchronization/conflict problems are the same in both cases.

>[ And along vaguely related lines, I'd really like if the ecompleterc
>  database could be somehow shared between my different machines.
>  E.g. by arranging for git-merge to "do-the-right-thing" on it, or by
>  storing (a copy of) it in IMAP.  ]

I haven't thought much about that, because I solve that problem out-of-band right now: my mailaprop database is under version control and gets automatically sync'd across all the machines I work on (and the same would be true of .ecompleterc if I were using that).  I agree it would be a good thing if Emacs solved that automagically, as long as it were truly reliable.

Best regards,
-Karl

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Extending the ecomplete.el data store.
  2018-02-04 23:54   ` Karl Fogel
@ 2018-02-05  2:34     ` Stefan Monnier
  2018-02-05  7:17       ` Karl Fogel
  0 siblings, 1 reply; 20+ messages in thread
From: Stefan Monnier @ 2018-02-05  2:34 UTC (permalink / raw)
  To: Karl Fogel; +Cc: emacs-devel

> Mailaprop tries to figure out the "best" variant of a given address, and
> assign that variant a higher score than any of the other variants, so that
> the "best" one shows up higher in the completion list than any of
> those others.

OK, I see.  It's kind of a pain having to keep all that info just for
that little tweak, but I guess it can indeed be significant.

And, I guess, in various corner cases, explicit user input to select
the right variant might be needed.

This said, there can be various other info that could determine which
alternative to use.  E.g. you might like to use nicknames when sending
to a group of close friends, but more official names when sending to
some of the same person but as part of a work email.  So maybe we should
keep more info than you currently have (i.e. keep a list of other email
addresses that appeared in the same message).

> That's probably related only to how ecomplete generates its database;
> I don't think it affects the format of the database.

No, indeed, it's not related to the format.  More a question of
process synchronization.

>>[ And along vaguely related lines, I'd really like if the ecompleterc
>>  database could be somehow shared between my different machines.
>>  E.g. by arranging for git-merge to "do-the-right-thing" on it, or by
>>  storing (a copy of) it in IMAP.  ]
>
> I haven't thought much about that, because I solve that problem out-of-band
> right now: my mailaprop database is under version control and gets
> automatically sync'd across all the machines I work on (and the same would
> be true of .ecompleterc if I were using that).  I agree it would be a good
> thing if Emacs solved that automagically, as long as it were truly reliable.

I tried doing the same with ecompleterc but that results in too may
conflicts that are annoying to resolve by hand.
Based on your description of the format you use, I'm surprised you're
not suffering from conflicts as well.

        Stefan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Extending the ecomplete.el data store.
  2018-02-05  2:34     ` Stefan Monnier
@ 2018-02-05  7:17       ` Karl Fogel
  2018-02-05 18:30         ` Stefan Monnier
  0 siblings, 1 reply; 20+ messages in thread
From: Karl Fogel @ 2018-02-05  7:17 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

Stefan Monnier <monnier@IRO.UMontreal.CA> writes:
>This said, there can be various other info that could determine which
>alternative to use.  E.g. you might like to use nicknames when sending
>to a group of close friends, but more official names when sending to
>some of the same person but as part of a work email.  So maybe we should
>keep more info than you currently have (i.e. keep a list of other email
>addresses that appeared in the same message).

Yup.  That's one reason why I think it's worth moving to this more extensible format.  Such things could be added later, and the format could handle them.  (Maybe we should add a top level symbol in each record stating what flavor of record it is -- e.g., for the current proposal, that symbol would be 'email-addr' or something.)

>> I haven't thought much about that, because I solve that problem out-of-band
>> right now: my mailaprop database is under version control and gets
>> automatically sync'd across all the machines I work on (and the same would
>> be true of .ecompleterc if I were using that).  I agree it would be a good
>> thing if Emacs solved that automagically, as long as it were truly reliable.
>
>I tried doing the same with ecompleterc but that results in too may
>conflicts that are annoying to resolve by hand.
>Based on your description of the format you use, I'm surprised you're
>not suffering from conflicts as well.

Remember, mailaprop isn't updating the datastore during the Emacs session -- when Emacs exits, mailaprop isn't writing anything out to disk.  Instead, the database is regenerated whole each time, from the entire email corpus.  When I do that regeneration, the last step is to push the new database up to the version control repository.

If mailaprop were like ecomplete, updating the database from activity during the Emacs session (which I think would be a good feature to add some day), then it would indeed have the same conflict problem you experienced.

Best regards,
-Karl

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Extending the ecomplete.el data store.
  2018-02-04  6:16 Extending the ecomplete.el data store Karl Fogel
  2018-02-04 22:33 ` Stefan Monnier
@ 2018-02-05  9:40 ` Lars Ingebrigtsen
  2018-02-06 20:17   ` Karl Fogel
  2018-02-06 21:12   ` Stefan Monnier
  1 sibling, 2 replies; 20+ messages in thread
From: Lars Ingebrigtsen @ 2018-02-05  9:40 UTC (permalink / raw)
  To: Karl Fogel; +Cc: Emacs Devel

Karl Fogel <kfogel@red-bean.com> writes:

> * Mailaprop remembers all the real-name variations and case variations
> individually, including case variations in the email address portion
> as well as in the real name portion.  So each variation gets its own
> record, but they're all tied together under the same case-folded KEY
> so they can be scored together.  (Contrast with ecomplete, where I
> believe `ecomplete-add-item' just remembers the most recently-seen
> variant for a given key.)

Yes, I see the advantages of storing all the variations (it gives us a
larger search space).

However, I've found that in practice the simple "store the last
variation" thing works surprisingly well.  But the disadvantage is that
you basically lose the completion if the last variation is degenerate,
like if you'd written "From: HAHAHA <kfogel@red-bean.com>", then my
Message/icomplete wouldn't be able to complete on "Karl" (which is what
you'd get normally).

On the other hand, if you store all variations, then HAHAHA will forever
be an available completion, too, which also has disadvantages.

So: Either complete historical completion, or uncomplete, but pretty
up-to-date completion.

If you have too much to complete on, you just end up with noise.

> If we unify the format, we should probably unify on one default record
> file too.  Right now, `ecomplete-database-file' defaults to
> ~/.ecompleterc or ~/.emacs.d/ecompleterc, whereas
> `mailaprop-address-file' doesn't default to anything -- the user must
> set it manually: email addresses are pretty private, and I didn't want
> to guess about what locations would be confidential enough.  I'd be
> happy to just have mailaprop use ecomplete's defaults for the database
> file, though.  The privacy concern can be addressed with
> documentation.

If the user has said that they want completion, the user will surmise
that the data has to be stored somehow.

> I guess we would also switch to UTF-8 for the coding system for the
> database?  (Right now `ecomplete-database-file-coding-system' defaults
> to `iso-2022-7bit'.)

The latter can store more than the former, but UTF-8 is fine by me.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Extending the ecomplete.el data store.
  2018-02-04 22:33 ` Stefan Monnier
  2018-02-04 23:54   ` Karl Fogel
@ 2018-02-05  9:41   ` Lars Ingebrigtsen
  2018-02-06 21:01     ` Modifying a shared file (was: Extending the ecomplete.el data store) Stefan Monnier
  1 sibling, 1 reply; 20+ messages in thread
From: Lars Ingebrigtsen @ 2018-02-05  9:41 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

> I haven't used ecomplete very much so far, but I've noticed some issues
> which I think are linked to having multiple Emacs sessions use it at the
> same time.  I haven't investigated enough to be sure, but in any case
> it's a use case that should be kept in mind.

Yes, ecomplete should check the modification date and reload the data if
it has changed.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Extending the ecomplete.el data store.
  2018-02-05  7:17       ` Karl Fogel
@ 2018-02-05 18:30         ` Stefan Monnier
  2018-02-06 20:19           ` Karl Fogel
  0 siblings, 1 reply; 20+ messages in thread
From: Stefan Monnier @ 2018-02-05 18:30 UTC (permalink / raw)
  To: Karl Fogel; +Cc: emacs-devel

>>This said, there can be various other info that could determine which
>>alternative to use.  E.g. you might like to use nicknames when sending
>>to a group of close friends, but more official names when sending to
>>some of the same person but as part of a work email.  So maybe we should
>>keep more info than you currently have (i.e. keep a list of other email
>>addresses that appeared in the same message).
> Yup.  That's one reason why I think it's worth moving to this more
> extensible format.  Such things could be added later, and the format could
> handle them.  (Maybe we should add a top level symbol in each record stating
> what flavor of record it is -- e.g., for the current proposal, that symbol
> would be 'email-addr' or something.)

Hmm... the more I think about it, the more it seems that the data and
the selection algorithm are fairly tightly bound, so having a generic
format might not give very much benefit, because you'll need specialized
code to make good sense of it anyway.

So I'm wondering if we're not better off with "use your own format" and
then use completion-at-point-functions (or something like it) as the
compatibility layer.


        Stefan



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Extending the ecomplete.el data store.
  2018-02-05  9:40 ` Extending the ecomplete.el data store Lars Ingebrigtsen
@ 2018-02-06 20:17   ` Karl Fogel
  2018-04-10 20:47     ` Lars Ingebrigtsen
  2018-02-06 21:12   ` Stefan Monnier
  1 sibling, 1 reply; 20+ messages in thread
From: Karl Fogel @ 2018-02-06 20:17 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: Emacs Devel

Lars Ingebrigtsen <larsi@gnus.org> writes:
>> * Mailaprop remembers all the real-name variations and case variations
>> individually, including case variations in the email address portion
>> as well as in the real name portion.  So each variation gets its own
>> record, but they're all tied together under the same case-folded KEY
>> so they can be scored together.  (Contrast with ecomplete, where I
>> believe `ecomplete-add-item' just remembers the most recently-seen
>> variant for a given key.)
>
>Yes, I see the advantages of storing all the variations (it gives us a
>larger search space).
>
>However, I've found that in practice the simple "store the last
>variation" thing works surprisingly well.  But the disadvantage is that
>you basically lose the completion if the last variation is degenerate,
>like if you'd written "From: HAHAHA <kfogel@red-bean.com>", then my
>Message/icomplete wouldn't be able to complete on "Karl" (which is what
>you'd get normally).
>
>On the other hand, if you store all variations, then HAHAHA will forever
>be an available completion, too, which also has disadvantages.

That's where creative scoring comes in.  For example, mailaprop handles that case by inspecting the variants and simply assigning higher scores to the better ones.  It has an idea of what "better" means: "Lars Ingebrigtsen <larsi@example.com>" is better than "L. Ingebrigtsen <larsi@example.com>", according to mailaprop.

>So: Either complete historical completion, or uncomplete, but pretty
>up-to-date completion.

I don't think that's the choice we face.  Rather, the choice is: have enough information to make interesting decisions, or not have enough information :-).

I think you're conflating the storage format with the in-session UI behavior.  Ecomplete can continue to throw away all but the most recent variant, if it wishes.  Other programs can have use all of the data and run it through super fancy machine-learning convoluted neural network AI bots working in tandem with a crowdsourced social media strategy that leverages the power of decentralized blockchain advertising affiliate networks to determine what completions they're going to offer.

But for programs to have this choice, the storage format must hold all the data that seems obviously relevant (and be extensible, in case somebody thinks of something later).  Then it's up to the programs to decide what subset of that data they want to use.  They don't have to use all of it.

>If you have too much to complete on, you just end up with noise.

Not really, because scoring allows one to put the right completions near the top.  I rely on this every day now: for the vast majority of recipient addresses, I only have to type one or two letters and hit Return, because the choice I wanted is also the one that's scored highest.  Very occasionally I have to type a longer substring -- and in those cases, being able to type just, say, "lars ing RET" and have the Right Thing happen is a lovely user experience.

>> I guess we would also switch to UTF-8 for the coding system for the
>> database?  (Right now `ecomplete-database-file-coding-system' defaults
>> to `iso-2022-7bit'.)
>
>The latter can store more than the former, but UTF-8 is fine by me.

Thanks.  I didn't know that; until now, I didn't realize what ISO-2022 actually is [1].  I tend to lean UTF-8 because it's a widely-supported standard, e.g., if someone brings up their database file in a buffer or pages through it with a command-line pager, it'll usually be readable in both cases.  

Best regards,
-Karl

[1] Just looked at https://en.wikipedia.org/wiki/ISO/IEC_2022#Comparison_with_other_encodings now.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Extending the ecomplete.el data store.
  2018-02-05 18:30         ` Stefan Monnier
@ 2018-02-06 20:19           ` Karl Fogel
  2018-02-06 20:39             ` Stefan Monnier
  0 siblings, 1 reply; 20+ messages in thread
From: Karl Fogel @ 2018-02-06 20:19 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

Stefan Monnier <monnier@IRO.UMontreal.CA> writes:
>>>This said, there can be various other info that could determine which
>>>alternative to use.  E.g. you might like to use nicknames when sending
>>>to a group of close friends, but more official names when sending to
>>>some of the same person but as part of a work email.  So maybe we should
>>>keep more info than you currently have (i.e. keep a list of other email
>>>addresses that appeared in the same message).
>> Yup.  That's one reason why I think it's worth moving to this more
>> extensible format.  Such things could be added later, and the format could
>> handle them.  (Maybe we should add a top level symbol in each record stating
>> what flavor of record it is -- e.g., for the current proposal, that symbol
>> would be 'email-addr' or something.)
>
>Hmm... the more I think about it, the more it seems that the data and
>the selection algorithm are fairly tightly bound, so having a generic
>format might not give very much benefit, because you'll need specialized
>code to make good sense of it anyway.

Well, this all started because Lars and I observed that ecomplete and mailaprop currently use very different selection algorithms yet the raw data they work with is very similar -- not exactly the same, but highly overlapping.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Extending the ecomplete.el data store.
  2018-02-06 20:19           ` Karl Fogel
@ 2018-02-06 20:39             ` Stefan Monnier
  0 siblings, 0 replies; 20+ messages in thread
From: Stefan Monnier @ 2018-02-06 20:39 UTC (permalink / raw)
  To: Karl Fogel; +Cc: emacs-devel

> Well, this all started because Lars and I observed that ecomplete and
> mailaprop currently use very different selection algorithms yet the raw data
> they work with is very similar -- not exactly the same, but
> highly overlapping.

So I guess my conclusion is that if you want to share something, there's
maybe no point trying to spend much time designing something very general:
most of the effort is in the selection algorithm anyway, not in the
reading&writing of the raw data.


        Stefan



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Modifying a shared file (was: Extending the ecomplete.el data store)
  2018-02-05  9:41   ` Lars Ingebrigtsen
@ 2018-02-06 21:01     ` Stefan Monnier
  2018-02-06 22:33       ` Modifying a shared file Clément Pit-Claudel
  0 siblings, 1 reply; 20+ messages in thread
From: Stefan Monnier @ 2018-02-06 21:01 UTC (permalink / raw)
  To: emacs-devel

>> I haven't used ecomplete very much so far, but I've noticed some issues
>> which I think are linked to having multiple Emacs sessions use it at the
>> same time.  I haven't investigated enough to be sure, but in any case
>> it's a use case that should be kept in mind.
> Yes, ecomplete should check the modification date and reload the data if
> it has changed.

[ Plus some kind of locking to avoid modifying the file at the
  same time.  ]

I'd welcome some new function

    (defun update-file (file modification)
      "Apply MODIFICATION to FILE."
      ...)

tho we'd want to avoid re-reading the file if it hasn't been modified
since last time.  So maybe we need to first define

    (cl-defstruct file-contents
      (file nil :type string)
      (reader nil :type function)
      (writer nil :type function)
      timestamp
      val)

and then

    (defun file-contents-read (file &optional reader writer)
      "Return a `file-contents` object representing the contents of FILE."
      (with-temp-buffer
        (insert-file-contents file)
        (make-file-contents :file file :reader reader :writer writer
                            :timestamp ...
                            :val (funcall (or reader #'read))))

    (defun file-contents-update (fc modification)
      "Apply MODIFICATION both to FC and to its file."
      ...)

where `modification` is a function that takes the old value and returns
a new value (and it might be called several times).


-- Stefan




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Extending the ecomplete.el data store.
  2018-02-05  9:40 ` Extending the ecomplete.el data store Lars Ingebrigtsen
  2018-02-06 20:17   ` Karl Fogel
@ 2018-02-06 21:12   ` Stefan Monnier
  2018-02-06 23:04     ` Karl Fogel
  2018-04-10 21:00     ` Lars Ingebrigtsen
  1 sibling, 2 replies; 20+ messages in thread
From: Stefan Monnier @ 2018-02-06 21:12 UTC (permalink / raw)
  To: emacs-devel

> Yes, I see the advantages of storing all the variations (it gives us a
> larger search space).

The downside is that it slows down operation at times.

> However, I've found that in practice the simple "store the last
> variation" thing works surprisingly well.  But the disadvantage is that
> you basically lose the completion if the last variation is degenerate,
> like if you'd written "From: HAHAHA <kfogel@red-bean.com>", then my
> Message/icomplete wouldn't be able to complete on "Karl" (which is what
> you'd get normally).

Another approach is to update more carefully: check that all words from
the old variation still appear in the new variation, and if not, check
if new words appeared (as above): if not it means replacing the old with
the new would reduce the amount of info so it's probably not a good
idea, and if yes then prompt the user.

>> I guess we would also switch to UTF-8 for the coding system for the
>> database?  (Right now `ecomplete-database-file-coding-system' defaults
>> to `iso-2022-7bit'.)
> The latter can store more than the former, but UTF-8 is fine by me.

I think it'd make sense to use `emacs-internal` coding system (aka
utf-8-emacs-unix).


        Stefan




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Modifying a shared file
  2018-02-06 21:01     ` Modifying a shared file (was: Extending the ecomplete.el data store) Stefan Monnier
@ 2018-02-06 22:33       ` Clément Pit-Claudel
  0 siblings, 0 replies; 20+ messages in thread
From: Clément Pit-Claudel @ 2018-02-06 22:33 UTC (permalink / raw)
  To: emacs-devel

On 2018-02-06 16:01, Stefan Monnier wrote:
>>> I haven't used ecomplete very much so far, but I've noticed some issues
>>> which I think are linked to having multiple Emacs sessions use it at the
>>> same time.  I haven't investigated enough to be sure, but in any case
>>> it's a use case that should be kept in mind.
>> Yes, ecomplete should check the modification date and reload the data if
>> it has changed.
> 
> [ Plus some kind of locking to avoid modifying the file at the
>   same time.  ]
> 
> I'd welcome some new function
> 
>     (defun update-file (file modification)
>       "Apply MODIFICATION to FILE."
>       ...)

This would be great. It would work nicely for recentf and savehist-mode, I think.




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Extending the ecomplete.el data store.
  2018-02-06 21:12   ` Stefan Monnier
@ 2018-02-06 23:04     ` Karl Fogel
  2018-02-06 23:21       ` Stefan Monnier
  2018-04-10 21:00     ` Lars Ingebrigtsen
  1 sibling, 1 reply; 20+ messages in thread
From: Karl Fogel @ 2018-02-06 23:04 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:
>> Yes, I see the advantages of storing all the variations (it gives us a
>> larger search space).
>
>The downside is that it slows down operation at times.

It could affect startup and shutdown time.  It should have no effect on operations during the session (because the on-disk format isn't the same as what is used for interactive completion, and needn't even be the same as what is used for on-the-fly retention of new addresses).

>Another approach is to update more carefully: check that all words from
>the old variation still appear in the new variation, and if not, check
>if new words appeared (as above): if not it means replacing the old with
>the new would reduce the amount of info so it's probably not a good
>idea, and if yes then prompt the user.

Yes; a ratchet that only moves in the "better" direction.  That's a good solution too.

>>> I guess we would also switch to UTF-8 for the coding system for the
>>> database?  (Right now `ecomplete-database-file-coding-system' defaults
>>> to `iso-2022-7bit'.)
>> The latter can store more than the former, but UTF-8 is fine by me.
>
>I think it'd make sense to use `emacs-internal` coding system (aka
>utf-8-emacs-unix).

*nod*

It sounds like the format unification would only happen if Lars or I feels strongly enough to make it happen, which I'm not sure either of us does right now.  This thread has at least recorded some of the thinking.  Maybe at some point writing a lossless converter from the mailaprop format to ecomplete's format might be useful.

In the meantime, if I'm planning to actually take any action toward format unification, I'll make a noise here.

Best regards,
-Karl

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Extending the ecomplete.el data store.
  2018-02-06 23:04     ` Karl Fogel
@ 2018-02-06 23:21       ` Stefan Monnier
  2018-02-08 17:21         ` Karl Fogel
  0 siblings, 1 reply; 20+ messages in thread
From: Stefan Monnier @ 2018-02-06 23:21 UTC (permalink / raw)
  To: Karl Fogel; +Cc: emacs-devel

> In the meantime, if I'm planning to actually take any action toward format
> unification, I'll make a noise here.

A good halfway step would be for mailaprop to provide
a completion-at-point-function so it can also be used for
TAB-completion.

I think the main hurdle is that completion-at-point-function only
directly supports prefix completion.  The UI on top of it supports
substring completion, but it does it by requesting "all completions"
from the backend (i.e. from mailaprop in our case) and then doing the
substring search.  So if "HAHAHA" appears in one of the variants but not
in the "canonical" name, it won't be found.  And more importantly,
listing all completions needs to be fast/memoized otherwise the user
will not like the performance.

        Stefan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Extending the ecomplete.el data store.
  2018-02-06 23:21       ` Stefan Monnier
@ 2018-02-08 17:21         ` Karl Fogel
  0 siblings, 0 replies; 20+ messages in thread
From: Karl Fogel @ 2018-02-08 17:21 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

Stefan Monnier <monnier@IRO.UMontreal.CA> writes:
>A good halfway step would be for mailaprop to provide
>a completion-at-point-function so it can also be used for
>TAB-completion.

Thanks; I'll take a look.

>I think the main hurdle is that completion-at-point-function only
>directly supports prefix completion.  The UI on top of it supports
>substring completion, but it does it by requesting "all completions"
>from the backend (i.e. from mailaprop in our case) and then doing the
>substring search.  So if "HAHAHA" appears in one of the variants but not
>in the "canonical" name, it won't be found.  And more importantly,
>listing all completions needs to be fast/memoized otherwise the user
>will not like the performance.

*nod*  I don't know much about how core Emacs handles this.  I assume it does not memoize all the possible prefix strings and does not ask callers to do that either.  Mailaprop (which operates with substrings rather than just prefixes) memoizes every substring that gets completed at least once.  If there's any existing code in Emacs that you can point me to as a good example of the kind of memoization you're thinking of, please do.

Best regards,
-Karl

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Extending the ecomplete.el data store.
  2018-02-06 20:17   ` Karl Fogel
@ 2018-04-10 20:47     ` Lars Ingebrigtsen
  0 siblings, 0 replies; 20+ messages in thread
From: Lars Ingebrigtsen @ 2018-04-10 20:47 UTC (permalink / raw)
  To: Karl Fogel; +Cc: Emacs Devel

Karl Fogel <kfogel@red-bean.com> writes:

> I think you're conflating the storage format with the in-session UI
> behavior.  Ecomplete can continue to throw away all but the most
> recent variant, if it wishes.

Yeah, I think you're right.  Storing all the variations but making the
selection algorithm Do What I Mean is probably the best choice here, and
leads to the fewest corner cases and surprises.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Extending the ecomplete.el data store.
  2018-02-06 21:12   ` Stefan Monnier
  2018-02-06 23:04     ` Karl Fogel
@ 2018-04-10 21:00     ` Lars Ingebrigtsen
  2018-04-10 21:08       ` Stefan Monnier
  1 sibling, 1 reply; 20+ messages in thread
From: Lars Ingebrigtsen @ 2018-04-10 21:00 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

> Another approach is to update more carefully: check that all words from
> the old variation still appear in the new variation, and if not, check
> if new words appeared (as above): if not it means replacing the old with
> the new would reduce the amount of info so it's probably not a good
> idea, and if yes then prompt the user.

We'd still need to expand the storage so that we know what we're
expanding to.

So if we have the following scenario:

Karl Fogel <kfogel@red-bean.com>
HAHAHA <kfogel@red-bean.com>

we do not want to expand to "Karl Fogel HAHAHA <kfogel@red-bean.com>" if
the user types "Karl".  Perhaps we even want "HAHAHA" to expand to
"Karl Fogel <kfogel@red-bean.com>"?  In any case, I think the way to
achieve this is to do what Karl suggests, and store all the variations
and make the user interface be clever.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Extending the ecomplete.el data store.
  2018-04-10 21:00     ` Lars Ingebrigtsen
@ 2018-04-10 21:08       ` Stefan Monnier
  0 siblings, 0 replies; 20+ messages in thread
From: Stefan Monnier @ 2018-04-10 21:08 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: emacs-devel

>> Another approach is to update more carefully: check that all words from
>> the old variation still appear in the new variation, and if not, check
>> if new words appeared (as above): if not it means replacing the old with
>> the new would reduce the amount of info so it's probably not a good
>> idea, and if yes then prompt the user.
> So if we have the following scenario:
>
> Karl Fogel <kfogel@red-bean.com>
> HAHAHA <kfogel@red-bean.com>

In this scenario, neither of the two is a superset of the other, so we'd
prompt the user.


        Stefan



^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-04-10 21:08 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-02-04  6:16 Extending the ecomplete.el data store Karl Fogel
2018-02-04 22:33 ` Stefan Monnier
2018-02-04 23:54   ` Karl Fogel
2018-02-05  2:34     ` Stefan Monnier
2018-02-05  7:17       ` Karl Fogel
2018-02-05 18:30         ` Stefan Monnier
2018-02-06 20:19           ` Karl Fogel
2018-02-06 20:39             ` Stefan Monnier
2018-02-05  9:41   ` Lars Ingebrigtsen
2018-02-06 21:01     ` Modifying a shared file (was: Extending the ecomplete.el data store) Stefan Monnier
2018-02-06 22:33       ` Modifying a shared file Clément Pit-Claudel
2018-02-05  9:40 ` Extending the ecomplete.el data store Lars Ingebrigtsen
2018-02-06 20:17   ` Karl Fogel
2018-04-10 20:47     ` Lars Ingebrigtsen
2018-02-06 21:12   ` Stefan Monnier
2018-02-06 23:04     ` Karl Fogel
2018-02-06 23:21       ` Stefan Monnier
2018-02-08 17:21         ` Karl Fogel
2018-04-10 21:00     ` Lars Ingebrigtsen
2018-04-10 21:08       ` Stefan Monnier

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).