From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Dmitry Gutov Newsgroups: gmane.emacs.devel Subject: Re: dired-do-find-regexp failure with latin-1 encoding Date: Sun, 29 Nov 2020 21:48:04 +0200 Message-ID: <284dc041-6b3b-9f9f-aba1-04b7d79d6360@yandex.ru> References: <87blfhjr4q.fsf@gmx.net> <83k0u5mjvf.fsf@gnu.org> <877dq5jp51.fsf@gmx.net> <83im9pmh0v.fsf@gnu.org> <106736d6-1732-3f24-15c5-af7bcfd688c6@yandex.ru> <83blfhmdho.fsf@gnu.org> <247a8edb-7b70-ad32-1ba1-43b5458a82b0@yandex.ru> <42ba5cae-e0d7-afd1-9974-62e7ee5840c6@yandex.ru> <83360smbq8.fsf@gnu.org> <1142c209-27d4-292c-f087-e0ccb480d893@yandex.ru> <83mtz0krnh.fsf@gnu.org> <83h7p8knso.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="28998"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 Cc: stephen.berman@gmx.net, emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Sun Nov 29 20:49:19 2020 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1kjShD-0007QR-3u for ged-emacs-devel@m.gmane-mx.org; Sun, 29 Nov 2020 20:49:19 +0100 Original-Received: from localhost ([::1]:40050 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kjShC-0004Bx-5C for ged-emacs-devel@m.gmane-mx.org; Sun, 29 Nov 2020 14:49:18 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:42122) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kjSg7-0003cU-7y for emacs-devel@gnu.org; Sun, 29 Nov 2020 14:48:11 -0500 Original-Received: from mail-wr1-x430.google.com ([2a00:1450:4864:20::430]:41329) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1kjSg5-0000HW-2n; Sun, 29 Nov 2020 14:48:10 -0500 Original-Received: by mail-wr1-x430.google.com with SMTP id 23so12439884wrc.8; Sun, 29 Nov 2020 11:48:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=18JpjPzkSCV2MN6CF1BbAGd7QU98QkyWpzUHGauDlGw=; b=Kc/AKLMM6pngMKE7Eqr046Ap0atCmS4vxFiktc2MTihx1kOcv5ooiYG+dNTrm04Lma mEN62tZcYy6vVIOdcyFCcgkDJwngFObrg58ph3SCAlnfeqxd9FGbtj5WdeSbmYi6W5Yc tND2yaexNZPVfTPdFCKajvBtFCF7+SUVjLuTV58bG0y9ueUOTzXS/znkEqpZ1UaEdpcY PCG6uGc8U8OQkhaYGRZ4m9327w6qukGc0LnGk/fqms4VNFjeqNgLd0CfK1C3vgkgGIG2 +LOfg9QYjGStFpoJri6vBtcAV0v6ljGII6Zk8CfbUESXRBUyhZ4+fJg/X4qX6qGYMDih hmdA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:subject:to:cc:references:from:message-id :date:user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=18JpjPzkSCV2MN6CF1BbAGd7QU98QkyWpzUHGauDlGw=; b=SQttt6eekEQrPNdTeSSvGbc4xKN41aeP55gGilyKzdYj8k38MV73MK/pKxRKbfrCEX zjq0KsI8zBRUutbPq1cX9d4nHA66Bk4UouWdShbNPr6MFVuZbKfihUtg3yfkb5890Bl4 lsH5f02dzCZMnHm27heOLC/TTVNN2Vh/T8XTRlHh1q9juFSohHLk0V1JWZtBYNRXj3x6 666fIdScYreTV1fViLuE74SyvP3/JawfA0ywQZMC+xoqt48WYBLJr354eEYJ5XcQWfQU K7nGeTqVsNojBBsHT+d6yUyaSVtckAaBRJWx1mIolQ49MCe1lfrQiOEq+9ytZ+VSPEPc zRUg== X-Gm-Message-State: AOAM533u5cxPlZwwOBzEsEGWX4FLdpeqiZmRuV/4X+Awnyx5BACRX27I hC4/H/aYofQUpXW6QUAEtr3Q/+9MJ4N1Fw== X-Google-Smtp-Source: ABdhPJzo4UuSf48PeNEUN7moMbA8FNCQYjFpmb3K1EbSDeviWjQD9U8NNPrCmVLsCjp0qD6y2AcRrA== X-Received: by 2002:a5d:5643:: with SMTP id j3mr23399447wrw.43.1606679287039; Sun, 29 Nov 2020 11:48:07 -0800 (PST) Original-Received: from [192.168.0.4] ([66.205.71.3]) by smtp.googlemail.com with ESMTPSA id l14sm21330997wmi.33.2020.11.29.11.48.05 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 29 Nov 2020 11:48:06 -0800 (PST) In-Reply-To: <83h7p8knso.fsf@gnu.org> Content-Language: en-US Received-SPF: pass client-ip=2a00:1450:4864:20::430; envelope-from=raaahh@gmail.com; helo=mail-wr1-x430.google.com X-Spam_score_int: -4 X-Spam_score: -0.5 X-Spam_bar: / X-Spam_report: (-0.5 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FORGED_FROMDOMAIN=0.248, FREEMAIL_FROM=0.001, FREEMAIL_REPLY=1, HEADER_FROM_DIFFERENT_DOMAINS=0.248, NICE_REPLY_A=-0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:260029 Archived-At: On 29.11.2020 20:42, Eli Zaretskii wrote: >> Cc: stephen.berman@gmx.net, emacs-devel@gnu.org >> From: Dmitry Gutov >> Date: Sun, 29 Nov 2020 19:32:17 +0200 >> >> If the calls to the conversion program are done in parallel to the >> subsequent searches, reading the file twice might not be a problem (with >> the benefit of a disk cache). > > How do you mean "in parallel"? You cannot start searching until you > decide on the encoding, so it must not be in parallel. Since we're passing multiple files to Grep or RG at the same time, it could start deciding on the encoding of the next file while still searching the previous one. >>>> How does Emacs do it? Does it read until the end of the file? >>> >>> No, just a small initial part of it. That's one reason why the >>> results are not guaranteed to be correct. >> >> But if we consider that approach good enough for Emacs, it should >> probably be good enough for doing a search from inside Emacs. > > It's good enough when the encoding is the locale's codeset, and in a > few other (not very important) cases. For an arbitrary combination of > file's encoding and locale's codeset, the result can be wrong every > single time. > > And searching in non-ASCII files whose encoding is not the locale's > native one is precisely the case where this will fail. Granted, it's > a relatively rare use case, but when it does happen, all bets are off. Which will likely have affected the user (who is foremost an Emacs user) already, before he/did the search. > So reading just a small part, as Emacs does, will yield similar > percentage of wrong guesses. ...so that seems like a good thing. Anyway, that should work but you don't seem to be crazy about the approach, and I'm not in love with the potential implementation. So maybe we should stop and let it brew for a little while.