Re: dired-do-find-regexp failure with latin-1 encoding

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

From: Dmitry Gutov <dgutov@yandex.ru>
To: Juri Linkov <juri@linkov.net>
Cc: Eli Zaretskii <eliz@gnu.org>,
	stephen.berman@gmx.net, emacs-devel@gnu.org
Subject: Re: dired-do-find-regexp failure with latin-1 encoding
Date: Mon, 30 Nov 2020 03:08:40 +0200	[thread overview]
Message-ID: <59a60557-8cfc-fcdc-f0f5-e3e476c56aa1@yandex.ru> (raw)
In-Reply-To: <87tut8zfmk.fsf@mail.linkov.net>

On 29.11.2020 21:37, Juri Linkov wrote:

>> Do we want to search the "binary" files at all? Right now we simply filter
>> such matches out (see the definition of xref-matches-in-files), and I have
>> seen no complaints.
> 
> There are two cases: a really binary file, and a legit ascii file
> with an occasional ^@ char.  And grep can't distinguish one from another.
> There is an option --binary-files=binary, but unfortunately it doesn't help,
> it still outputs "Binary file matches".

Makes sense.

> So xref parser needs to be smart enough to detect whether the matched line
> contains binary garbage when '-a' is used, or it's purely ascii.

I guess we can do that, but then some people might be a bit unhappy 
about not being able to search inside such files? It could be useful on 
occasion, too (TBC below *).

> Moreover, I think we should apply the same heuristics to the grep output
> in grep.el and add '-a' to the grep command by default.

I guess we should. Or do the LC_ALL thing. I'm still unclear on the 
difference in effect between the two.

> Then grep.el
> should prettify the lines with real binary garbage e.g. by hiding groups of
> bytes between 0 and 32, or adding a 'display' property with ellipsis.

Why not. xref could also do something like that.

>> Our interpreter is our regexp with which we parse. But I suppose as long as
>> Grep doesn't insert unexpected newlines, the parser will be fine.
> 
> For grep output a bigger problem is that grep on binary data
> might output too long lines before the terminating newline.

(*) We already have this kind of problem with "normal" files which 
contain minified assets (JS or CSS). The file contents are usually 
normal ASCII, but it's just one line which can reach several MBs in length.

The usual way to deal with that is with project-ignores and 
grep-find-ignored-files. That works for both cases.

>>> I actually don't think I understand why we need -a in this case, since
>>> Grep looks for null bytes to decide this is a binary file, and encoded
>>> non-ASCII characters don't have null bytes 9except if they are in
>>> UTF-16).
>>
>> Good question.
> 
> The grep manual says that binary data are either output bytes that
> are improperly encoded for the current locale, or null input bytes.

So... if we add LC_ALL=C but not '-a' we will allow the "improperly 
encoded" case but not the "null input bytes" one?

next prev parent reply	other threads:[~2020-11-30  1:08 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-28 18:03 dired-do-find-regexp failure with latin-1 encoding Stephen Berman
2020-11-28 18:11 ` Eli Zaretskii
2020-11-28 18:46   ` Stephen Berman
2020-11-28 19:13     ` Eli Zaretskii
2020-11-28 19:44       ` Stephen Berman
2020-11-28 19:49         ` Eli Zaretskii
2020-11-28 20:16       ` Dmitry Gutov
2020-11-28 20:29         ` Eli Zaretskii
2020-11-28 21:04           ` Dmitry Gutov
2020-11-29  0:49             ` Dmitry Gutov
2020-11-29 15:19               ` Eli Zaretskii
2020-11-29 16:27                 ` Dmitry Gutov
2020-11-29 17:18                   ` Eli Zaretskii
2020-11-29 17:32                     ` Dmitry Gutov
2020-11-29 18:42                       ` Eli Zaretskii
2020-11-29 19:48                         ` Dmitry Gutov
2020-11-29 15:06             ` Eli Zaretskii
2020-11-29 15:14               ` Yuri Khan
2020-11-29 15:36                 ` Stephen Berman
2020-11-29 15:50                 ` Eli Zaretskii
2020-11-29 16:07               ` Dmitry Gutov
2020-11-29 17:12                 ` Eli Zaretskii
2020-11-29 17:19                   ` Dmitry Gutov
2020-11-29 17:25                     ` Eli Zaretskii
2020-11-29 17:44                       ` Dmitry Gutov
2020-11-29 18:51                         ` Eli Zaretskii
2020-11-29 19:07                           ` Dmitry Gutov
2020-11-29 19:32                             ` Eli Zaretskii
2020-11-29 19:34                               ` Eli Zaretskii
2020-11-29 19:49                             ` Stephen Berman
2020-11-29 19:49                           ` Gregory Heytings via Emacs development discussions.
2020-11-29 19:37             ` Juri Linkov
2020-11-30  1:08               ` Dmitry Gutov [this message]
2020-11-30 20:54                 ` Juri Linkov
2020-12-01  0:34                   ` Dmitry Gutov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=59a60557-8cfc-fcdc-f0f5-e3e476c56aa1@yandex.ru \
    --to=dgutov@yandex.ru \
    --cc=eliz@gnu.org \
    --cc=emacs-devel@gnu.org \
    --cc=juri@linkov.net \
    --cc=stephen.berman@gmx.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).