From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: dired-do-find-regexp failure with latin-1 encoding Date: Sun, 29 Nov 2020 20:42:15 +0200 Message-ID: <83h7p8knso.fsf@gnu.org> References: <87blfhjr4q.fsf@gmx.net> <83k0u5mjvf.fsf@gnu.org> <877dq5jp51.fsf@gmx.net> <83im9pmh0v.fsf@gnu.org> <106736d6-1732-3f24-15c5-af7bcfd688c6@yandex.ru> <83blfhmdho.fsf@gnu.org> <247a8edb-7b70-ad32-1ba1-43b5458a82b0@yandex.ru> <42ba5cae-e0d7-afd1-9974-62e7ee5840c6@yandex.ru> <83360smbq8.fsf@gnu.org> <1142c209-27d4-292c-f087-e0ccb480d893@yandex.ru> <83mtz0krnh.fsf@gnu.org> Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="33965"; mail-complaints-to="usenet@ciao.gmane.io" Cc: stephen.berman@gmx.net, emacs-devel@gnu.org To: Dmitry Gutov Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Sun Nov 29 19:43:44 2020 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1kjRfj-0008jb-K1 for ged-emacs-devel@m.gmane-mx.org; Sun, 29 Nov 2020 19:43:43 +0100 Original-Received: from localhost ([::1]:42766 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kjRfi-0004mn-JJ for ged-emacs-devel@m.gmane-mx.org; Sun, 29 Nov 2020 13:43:42 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:58032) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kjReV-0004L6-Ti for emacs-devel@gnu.org; Sun, 29 Nov 2020 13:42:30 -0500 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:38242) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kjReU-0003pu-EQ; Sun, 29 Nov 2020 13:42:26 -0500 Original-Received: from [176.228.60.248] (port=2802 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1kjReT-0001ti-MR; Sun, 29 Nov 2020 13:42:26 -0500 In-Reply-To: (message from Dmitry Gutov on Sun, 29 Nov 2020 19:32:17 +0200) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:260023 Archived-At: > Cc: stephen.berman@gmx.net, emacs-devel@gnu.org > From: Dmitry Gutov > Date: Sun, 29 Nov 2020 19:32:17 +0200 > > If the calls to the conversion program are done in parallel to the > subsequent searches, reading the file twice might not be a problem (with > the benefit of a disk cache). How do you mean "in parallel"? You cannot start searching until you decide on the encoding, so it must not be in parallel. > >> How does Emacs do it? Does it read until the end of the file? > > > > No, just a small initial part of it. That's one reason why the > > results are not guaranteed to be correct. > > But if we consider that approach good enough for Emacs, it should > probably be good enough for doing a search from inside Emacs. It's good enough when the encoding is the locale's codeset, and in a few other (not very important) cases. For an arbitrary combination of file's encoding and locale's codeset, the result can be wrong every single time. And searching in non-ASCII files whose encoding is not the locale's native one is precisely the case where this will fail. Granted, it's a relatively rare use case, but when it does happen, all bets are off. So reading just a small part, as Emacs does, will yield similar percentage of wrong guesses.