From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Karl Fogel Newsgroups: gmane.emacs.devel Subject: Re: Extending the ecomplete.el data store. Date: Tue, 06 Feb 2018 14:17:33 -0600 Message-ID: <87zi4lyisi.fsf@red-bean.com> References: <87fu6hcm9r.fsf@red-bean.com> Reply-To: Karl Fogel NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain X-Trace: blaine.gmane.org 1517949949 30379 195.159.176.226 (6 Feb 2018 20:45:49 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Tue, 6 Feb 2018 20:45:49 +0000 (UTC) User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (gnu/linux) Cc: Emacs Devel To: Lars Ingebrigtsen Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Feb 06 21:45:44 2018 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ejA7V-0007BB-2T for ged-emacs-devel@m.gmane.org; Tue, 06 Feb 2018 21:45:37 +0100 Original-Received: from localhost ([::1]:60158 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ejA9W-0000OC-AL for ged-emacs-devel@m.gmane.org; Tue, 06 Feb 2018 15:47:42 -0500 Original-Received: from eggs.gnu.org ([208.118.235.92]:59496) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ej9kR-0002ug-6e for emacs-devel@gnu.org; Tue, 06 Feb 2018 15:22:50 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ej9gP-0007FP-0z for emacs-devel@gnu.org; Tue, 06 Feb 2018 15:21:45 -0500 Original-Received: from mail-io0-x230.google.com ([2607:f8b0:4001:c06::230]:36710) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1ej9gO-0007F3-Of for emacs-devel@gnu.org; Tue, 06 Feb 2018 15:17:36 -0500 Original-Received: by mail-io0-x230.google.com with SMTP id l17so3843666ioc.3 for ; Tue, 06 Feb 2018 12:17:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:to:cc:subject:references:reply-to:date:in-reply-to :message-id:user-agent:mime-version; bh=dBk9WoriFNwlHxIzm4JaRCverWV+8Sdkztg3gzxG47A=; b=gMju9++k028EnxCLTEbPkg26cTPvnAAOxnXQUMCiNVRZqkK4MSDfPb3E8XvUYkp6Zs NmgS1c0h8dOojed2NVhMMMtHoc/R9QvTXJh5etGYDvur/Hl8cujuVpI7NLLHuW78zJSk OQPTUtI1SVuZAfjQ4kKivwgWZVgm8DHAlsx5MMp6R83GRXeHlCiciBjQxXIZv39xJvqw aHFJUHK1QtogOZg/mkIU3Nb0m63o/uO0FT5Sx0Zp4Wx/jQDcDsUOZVv/q/9sSk/qVwcS OQLj1KcYHtz5nyl7SM0IPMkzd6vEhtCTkp4BEhMjDK8dO98gSy5jdNpQecq4JIzoPGQY io0A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:from:to:cc:subject:references:reply-to :date:in-reply-to:message-id:user-agent:mime-version; bh=dBk9WoriFNwlHxIzm4JaRCverWV+8Sdkztg3gzxG47A=; b=ig+nz3sgcLn6379b2GR5JiI6fZwRXZPdvHfyRmhyVP/g2E6nzzHNdK200q7hDu3DFW PScewG2FapvguwDuSv5UqJZBjOz5zdmCNFH6Pkl1Stp6HUgEKSjwjLcqgr7o4SuqIC0+ T3685EM9E820JKrOlwZ2X57yJn0TGUV/FYipRELzFq7L07ujDXYwa8q1GQlRdmruaIpL SPI+jDHWC18yD9aLUfGsK0sU3S9yQHSgcWotVquo39ht0xj45rFU+XkMYtMmObIzIh5n mA/iOcb+d028X9QnQ5yuzk5YCbMjAIaComhvfwY2evqMTM0lAvG6tKZ7B103IrEa3mYs XjOw== X-Gm-Message-State: APf1xPBWMzbxAIQEYB6CExe0lRS6lJgnFk3qlfB2/3nVl6HRHdZUaStx IJyEabJOBb4y83IBKBHupfSBSg== X-Google-Smtp-Source: AH8x2241qvGZuPIGb883HgO2a7QiKXamxuIQ4yr8AbxFVuhj+3lVTHN0nsFJoTeZiS+4Lno/MZ0n5w== X-Received: by 10.107.22.199 with SMTP id 190mr4921804iow.242.1517948255457; Tue, 06 Feb 2018 12:17:35 -0800 (PST) Original-Received: from kwork (74-92-190-114-Illinois.hfc.comcastbusiness.net. [74.92.190.114]) by smtp.gmail.com with ESMTPSA id n95sm8334187ioo.12.2018.02.06.12.17.34 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 06 Feb 2018 12:17:34 -0800 (PST) In-Reply-To: (Lars Ingebrigtsen's message of "Mon, 05 Feb 2018 10:40:30 +0100") X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2607:f8b0:4001:c06::230 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:222588 Archived-At: Lars Ingebrigtsen writes: >> * Mailaprop remembers all the real-name variations and case variations >> individually, including case variations in the email address portion >> as well as in the real name portion. So each variation gets its own >> record, but they're all tied together under the same case-folded KEY >> so they can be scored together. (Contrast with ecomplete, where I >> believe `ecomplete-add-item' just remembers the most recently-seen >> variant for a given key.) > >Yes, I see the advantages of storing all the variations (it gives us a >larger search space). > >However, I've found that in practice the simple "store the last >variation" thing works surprisingly well. But the disadvantage is that >you basically lose the completion if the last variation is degenerate, >like if you'd written "From: HAHAHA ", then my >Message/icomplete wouldn't be able to complete on "Karl" (which is what >you'd get normally). > >On the other hand, if you store all variations, then HAHAHA will forever >be an available completion, too, which also has disadvantages. That's where creative scoring comes in. For example, mailaprop handles that case by inspecting the variants and simply assigning higher scores to the better ones. It has an idea of what "better" means: "Lars Ingebrigtsen " is better than "L. Ingebrigtsen ", according to mailaprop. >So: Either complete historical completion, or uncomplete, but pretty >up-to-date completion. I don't think that's the choice we face. Rather, the choice is: have enough information to make interesting decisions, or not have enough information :-). I think you're conflating the storage format with the in-session UI behavior. Ecomplete can continue to throw away all but the most recent variant, if it wishes. Other programs can have use all of the data and run it through super fancy machine-learning convoluted neural network AI bots working in tandem with a crowdsourced social media strategy that leverages the power of decentralized blockchain advertising affiliate networks to determine what completions they're going to offer. But for programs to have this choice, the storage format must hold all the data that seems obviously relevant (and be extensible, in case somebody thinks of something later). Then it's up to the programs to decide what subset of that data they want to use. They don't have to use all of it. >If you have too much to complete on, you just end up with noise. Not really, because scoring allows one to put the right completions near the top. I rely on this every day now: for the vast majority of recipient addresses, I only have to type one or two letters and hit Return, because the choice I wanted is also the one that's scored highest. Very occasionally I have to type a longer substring -- and in those cases, being able to type just, say, "lars ing RET" and have the Right Thing happen is a lovely user experience. >> I guess we would also switch to UTF-8 for the coding system for the >> database? (Right now `ecomplete-database-file-coding-system' defaults >> to `iso-2022-7bit'.) > >The latter can store more than the former, but UTF-8 is fine by me. Thanks. I didn't know that; until now, I didn't realize what ISO-2022 actually is [1]. I tend to lean UTF-8 because it's a widely-supported standard, e.g., if someone brings up their database file in a buffer or pages through it with a command-line pager, it'll usually be readable in both cases. Best regards, -Karl [1] Just looked at https://en.wikipedia.org/wiki/ISO/IEC_2022#Comparison_with_other_encodings now.