From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Juri Linkov Newsgroups: gmane.emacs.devel Subject: Re: dired-do-find-regexp failure with latin-1 encoding Date: Sun, 29 Nov 2020 21:37:23 +0200 Organization: LINKOV.NET Message-ID: <87tut8zfmk.fsf@mail.linkov.net> References: <87blfhjr4q.fsf@gmx.net> <83k0u5mjvf.fsf@gnu.org> <877dq5jp51.fsf@gmx.net> <83im9pmh0v.fsf@gnu.org> <106736d6-1732-3f24-15c5-af7bcfd688c6@yandex.ru> <83blfhmdho.fsf@gnu.org> <247a8edb-7b70-ad32-1ba1-43b5458a82b0@yandex.ru> Mime-Version: 1.0 Content-Type: text/plain Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="2715"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (x86_64-pc-linux-gnu) Cc: Eli Zaretskii , stephen.berman@gmx.net, emacs-devel@gnu.org To: Dmitry Gutov Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Sun Nov 29 20:52:34 2020 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1kjSkM-0000a7-0v for ged-emacs-devel@m.gmane-mx.org; Sun, 29 Nov 2020 20:52:34 +0100 Original-Received: from localhost ([::1]:46578 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kjSkK-0007sd-Uf for ged-emacs-devel@m.gmane-mx.org; Sun, 29 Nov 2020 14:52:32 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:42500) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kjShk-00057o-9o for emacs-devel@gnu.org; Sun, 29 Nov 2020 14:49:52 -0500 Original-Received: from relay4-d.mail.gandi.net ([217.70.183.196]:55319) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kjShi-0000qq-8z; Sun, 29 Nov 2020 14:49:52 -0500 X-Originating-IP: 91.129.99.98 Original-Received: from mail.gandi.net (m91-129-99-98.cust.tele2.ee [91.129.99.98]) (Authenticated sender: juri@linkov.net) by relay4-d.mail.gandi.net (Postfix) with ESMTPSA id 959BEE0008; Sun, 29 Nov 2020 19:49:45 +0000 (UTC) In-Reply-To: <247a8edb-7b70-ad32-1ba1-43b5458a82b0@yandex.ru> (Dmitry Gutov's message of "Sat, 28 Nov 2020 23:04:10 +0200") Received-SPF: pass client-ip=217.70.183.196; envelope-from=juri@linkov.net; helo=relay4-d.mail.gandi.net X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:260031 Archived-At: >>>> Adding -a probably cannot do any harm, but its support should be >>>> detected, since I don't think it's portable enough (it isn't in the >>>> latest Posix spec, at least). >>> >>> Are you sure about that? Are we sure it won't make searching binary >>> files slower, for example? >> It will be slower, but more useful: by default Grep just says "Binary >> file foo matches". > > Do we want to search the "binary" files at all? Right now we simply filter > such matches out (see the definition of xref-matches-in-files), and I have > seen no complaints. There are two cases: a really binary file, and a legit ascii file with an occasional ^@ char. And grep can't distinguish one from another. There is an option --binary-files=binary, but unfortunately it doesn't help, it still outputs "Binary file matches". So xref parser needs to be smart enough to detect whether the matched line contains binary garbage when '-a' is used, or it's purely ascii. Moreover, I think we should apply the same heuristics to the grep output in grep.el and add '-a' to the grep command by default. Then grep.el should prettify the lines with real binary garbage e.g. by hiding groups of bytes between 0 and 32, or adding a 'display' property with ellipsis. >>> Also, the manual has this warning: >>> >>> Warning: The -a option might output binary garbage, which can have >>> nasty side effects if the output is a terminal and if the terminal >>> driver interprets some of it as commands. >>> >>> ...which might conceivably mess up our parsing of Grep output sometimes? >> This is not relevant, since we read that output, there's no terminal >> device driver to interpret it and get messed up. > > Our interpreter is our regexp with which we parse. But I suppose as long as > Grep doesn't insert unexpected newlines, the parser will be fine. For grep output a bigger problem is that grep on binary data might output too long lines before the terminating newline. >> I actually don't think I understand why we need -a in this case, since >> Grep looks for null bytes to decide this is a binary file, and encoded >> non-ASCII characters don't have null bytes 9except if they are in >> UTF-16). > > Good question. The grep manual says that binary data are either output bytes that are improperly encoded for the current locale, or null input bytes.