dired-do-find-regexp failure with latin-1 encoding

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* dired-do-find-regexp failure with latin-1 encoding
@ 2020-11-28 18:03 Stephen Berman
  2020-11-28 18:11 ` Eli Zaretskii
  0 siblings, 1 reply; 35+ messages in thread
From: Stephen Berman @ 2020-11-28 18:03 UTC (permalink / raw)
  To: emacs-devel

My system's language encoding is en_US.UTF-8 but I have many files
encoded as iso-8859-1 (latin-1) and containing a mix of ASCII and
non-ASCII characters.  When I use dired-do-find-regexp on such files,
there are no matches in the *xref* buffer for lines containing both the
search string and a non-ASCII character.  If the file is encoded as
utf-8, then dired-do-find-regexp does find such lines.  Here's a minimal
reproducer:

0. echo aä > /tmp/test
1. emacs -Q /tmp/test ; the file encoding is utf-8
2. Type `C-x d RET', mark the file 'test', type `A a RET'
=> *xref* displays the line 'aä'
3. In buffer 'test' type `C-x RET f iso-8859-1 RET' and then `C-x C-s'
4. Repeat step 2
=> user-error: No matches for: a

dired-do-find-regexp calls xref-matches-in-files and that calls grep,
and that's where the failure happens, so strictly speaking this isn't an
Emacs bug, but it is a problem for users of dired-do-find-regexp
(dired-do-search and occur, for example, don't have this problem).  One
workaround is to add the -a option to the grep invocation in
xref-matches-in-files; then the search succeeds and the *xref* buffer
displays 'a\344'.  But this doesn't work if 'ä' is the search term.  For
the latter, I can get the correct output from grep by piping the output
of 'iconv -f ISO-8859-1 -t UTF-8' through to it, and indeed, prepending
'iconv -f ISO-8859-1 -t UTF-8 | ' to the grep invocation in
xref-matches-in-files does give the correct output in both cases.  But
this won't work if the file has a different non-utf-8 encoding, assuming
the issue isn't specific to latin-1.  Is there another alternative
(aside from "Someone™ can implement it in Emacs Lisp")?

Steve Berman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-28 18:03 dired-do-find-regexp failure with latin-1 encoding Stephen Berman
@ 2020-11-28 18:11 ` Eli Zaretskii
  2020-11-28 18:46   ` Stephen Berman
  0 siblings, 1 reply; 35+ messages in thread
From: Eli Zaretskii @ 2020-11-28 18:11 UTC (permalink / raw)
  To: Stephen Berman; +Cc: emacs-devel

> From: Stephen Berman <stephen.berman@gmx.net>
> Date: Sat, 28 Nov 2020 19:03:17 +0100
> 
> 0. echo aä > /tmp/test
> 1. emacs -Q /tmp/test ; the file encoding is utf-8
> 2. Type `C-x d RET', mark the file 'test', type `A a RET'
> => *xref* displays the line 'aä'
> 3. In buffer 'test' type `C-x RET f iso-8859-1 RET' and then `C-x C-s'
> 4. Repeat step 2
> => user-error: No matches for: a
> 
> dired-do-find-regexp calls xref-matches-in-files and that calls grep,
> and that's where the failure happens, so strictly speaking this isn't an
> Emacs bug, but it is a problem for users of dired-do-find-regexp
> (dired-do-search and occur, for example, don't have this problem).  One
> workaround is to add the -a option to the grep invocation in
> xref-matches-in-files; then the search succeeds and the *xref* buffer
> displays 'a\344'.  But this doesn't work if 'ä' is the search term.

Does it work for ä if you say

  C-x RET c latin-1 RET A ä RET

?



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-28 18:11 ` Eli Zaretskii
@ 2020-11-28 18:46   ` Stephen Berman
  2020-11-28 19:13     ` Eli Zaretskii
  0 siblings, 1 reply; 35+ messages in thread
From: Stephen Berman @ 2020-11-28 18:46 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

On Sat, 28 Nov 2020 20:11:48 +0200 Eli Zaretskii <eliz@gnu.org> wrote:

>> From: Stephen Berman <stephen.berman@gmx.net>
>> Date: Sat, 28 Nov 2020 19:03:17 +0100
>> 
>> 0. echo aä > /tmp/test
>> 1. emacs -Q /tmp/test ; the file encoding is utf-8
>> 2. Type `C-x d RET', mark the file 'test', type `A a RET'
>> => *xref* displays the line 'aä'
>> 3. In buffer 'test' type `C-x RET f iso-8859-1 RET' and then `C-x C-s'
>> 4. Repeat step 2
>> => user-error: No matches for: a
>> 
>> dired-do-find-regexp calls xref-matches-in-files and that calls grep,
>> and that's where the failure happens, so strictly speaking this isn't an
>> Emacs bug, but it is a problem for users of dired-do-find-regexp
>> (dired-do-search and occur, for example, don't have this problem).  One
>> workaround is to add the -a option to the grep invocation in
>> xref-matches-in-files; then the search succeeds and the *xref* buffer
>> displays 'a\344'.  But this doesn't work if 'ä' is the search term.
>
> Does it work for ä if you say
>
>   C-x RET c latin-1 RET A ä RET
>
> ?

Yes (with -a added to the grep invocation, but not without it).  And
then with either 'a' or 'ä' as the search term, *xref* displays 'aä'.
So this seems to be the best workaround, though inconvenient for
frequent uses (but easy enough to wrap a lambda around and bind it to a
key).  Do you then agree to adding -a to the grep invocation in
xref-matches-in-files?  Or could that have undesirable consequences?

Steve Berman



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-28 18:46   ` Stephen Berman
@ 2020-11-28 19:13     ` Eli Zaretskii
  2020-11-28 19:44       ` Stephen Berman
  2020-11-28 20:16       ` Dmitry Gutov
  0 siblings, 2 replies; 35+ messages in thread
From: Eli Zaretskii @ 2020-11-28 19:13 UTC (permalink / raw)
  To: Stephen Berman; +Cc: emacs-devel

> From: Stephen Berman <stephen.berman@gmx.net>
> Cc: emacs-devel@gnu.org
> Date: Sat, 28 Nov 2020 19:46:18 +0100
> 
> > Does it work for ä if you say
> >
> >   C-x RET c latin-1 RET A ä RET
> >
> > ?
> 
> Yes (with -a added to the grep invocation, but not without it).  And
> then with either 'a' or 'ä' as the search term, *xref* displays 'aä'.
> So this seems to be the best workaround, though inconvenient for
> frequent uses

I really don't see any other way, especially if different files in the
directory have different encodings.  Grep looks for bytes, not
characters, and is agnostic to encoding.  And even if we'd do this in
Emacs Lisp, we'd still need to trust Emacs to guess/detect the correct
encoding of each file.

> Do you then agree to adding -a to the grep invocation in
> xref-matches-in-files?  Or could that have undesirable consequences?

Adding -a probably cannot do any harm, but its support should be
detected, since I don't think it's portable enough (it isn't in the
latest Posix spec, at least).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-28 19:13     ` Eli Zaretskii
@ 2020-11-28 19:44       ` Stephen Berman
  2020-11-28 19:49         ` Eli Zaretskii
  2020-11-28 20:16       ` Dmitry Gutov
  1 sibling, 1 reply; 35+ messages in thread
From: Stephen Berman @ 2020-11-28 19:44 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

On Sat, 28 Nov 2020 21:13:20 +0200 Eli Zaretskii <eliz@gnu.org> wrote:

>> From: Stephen Berman <stephen.berman@gmx.net>
>> Cc: emacs-devel@gnu.org
>> Date: Sat, 28 Nov 2020 19:46:18 +0100
>> 
>> > Does it work for ä if you say
>> >
>> >   C-x RET c latin-1 RET A ä RET
>> >
>> > ?
>> 
>> Yes (with -a added to the grep invocation, but not without it).  And
>> then with either 'a' or 'ä' as the search term, *xref* displays 'aä'.
>> So this seems to be the best workaround, though inconvenient for
>> frequent uses
>
> I really don't see any other way, especially if different files in the
> directory have different encodings.

But then the above could not be used for arbitrary marked files in
Dired, right?  (The same goes for the iconv workaround, as I noted.)

>                                      Grep looks for bytes, not
> characters, and is agnostic to encoding.  And even if we'd do this in
> Emacs Lisp, we'd still need to trust Emacs to guess/detect the correct
> encoding of each file.

Don't we usually do that anyway?  And if it guesses wrong, the user can
always make the appropriate change.  And if Emacs can handle each file
differently as required, that's better than either of the above
workarounds (assuming Someone™ implements it).

>> Do you then agree to adding -a to the grep invocation in
>> xref-matches-in-files?  Or could that have undesirable consequences?
>
> Adding -a probably cannot do any harm, but its support should be
> detected, since I don't think it's portable enough (it isn't in the
> latest Posix spec, at least).

Detect it in xref-matches-in-files or somewhere in Lisp and not e.g. in
configure, right?  Is there a canonical way to do that?

Steve Berman



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-28 19:44       ` Stephen Berman
@ 2020-11-28 19:49         ` Eli Zaretskii
  0 siblings, 0 replies; 35+ messages in thread
From: Eli Zaretskii @ 2020-11-28 19:49 UTC (permalink / raw)
  To: Stephen Berman; +Cc: emacs-devel

> From: Stephen Berman <stephen.berman@gmx.net>
> Cc: emacs-devel@gnu.org
> Date: Sat, 28 Nov 2020 20:44:41 +0100
> 
> > I really don't see any other way, especially if different files in the
> > directory have different encodings.
> 
> But then the above could not be used for arbitrary marked files in
> Dired, right?

Not for arbitrary mixed encodings, no.

> >                                      Grep looks for bytes, not
> > characters, and is agnostic to encoding.  And even if we'd do this in
> > Emacs Lisp, we'd still need to trust Emacs to guess/detect the correct
> > encoding of each file.
> 
> Don't we usually do that anyway?

Do: yes.  Succeed: not necessarily.  Success is only guaranteed if the
encoding is the default locale's encoding; otherwise all bets are off.

> And if it guesses wrong, the user can always make the appropriate
> change.

What would that change be?

> And if Emacs can handle each file differently as required, that's
> better than either of the above workarounds (assuming Someone™
> implements it).

Better, but much slower.

> > Adding -a probably cannot do any harm, but its support should be
> > detected, since I don't think it's portable enough (it isn't in the
> > latest Posix spec, at least).
> 
> Detect it in xref-matches-in-files or somewhere in Lisp and not e.g. in
> configure, right?

Yes.

> Is there a canonical way to do that?

Wed already do that for some Grep switches, so you should see examples
in grep.el, I think.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-28 19:13     ` Eli Zaretskii
  2020-11-28 19:44       ` Stephen Berman
@ 2020-11-28 20:16       ` Dmitry Gutov
  2020-11-28 20:29         ` Eli Zaretskii
  1 sibling, 1 reply; 35+ messages in thread
From: Dmitry Gutov @ 2020-11-28 20:16 UTC (permalink / raw)
  To: Eli Zaretskii, Stephen Berman; +Cc: emacs-devel

On 28.11.2020 21:13, Eli Zaretskii wrote:
>> From: Stephen Berman <stephen.berman@gmx.net>
>> Cc: emacs-devel@gnu.org
>> Date: Sat, 28 Nov 2020 19:46:18 +0100
>>
>>> Does it work for ä if you say
>>>
>>>    C-x RET c latin-1 RET A ä RET
>>>
>>> ?
>>
>> Yes (with -a added to the grep invocation, but not without it).  And
>> then with either 'a' or 'ä' as the search term, *xref* displays 'aä'.
>> So this seems to be the best workaround, though inconvenient for
>> frequent uses
> 
> I really don't see any other way, especially if different files in the
> directory have different encodings.  Grep looks for bytes, not
> characters, and is agnostic to encoding.  And even if we'd do this in
> Emacs Lisp, we'd still need to trust Emacs to guess/detect the correct
> encoding of each file.

Ah, so this way the user explicitly searches for a regexp encoded as 
latin-1?

>> Do you then agree to adding -a to the grep invocation in
>> xref-matches-in-files?  Or could that have undesirable consequences?
> 
> Adding -a probably cannot do any harm, but its support should be
> detected, since I don't think it's portable enough (it isn't in the
> latest Posix spec, at least).

Are you sure about that? Are we sure it won't make searching binary 
files slower, for example?

Also, the manual has this warning:

Warning: The -a option might output  binary  garbage,  which  can  have 
nasty  side effects if the output is a terminal and if the terminal 
driver interprets some of it as commands.

...which might conceivably mess up our parsing of Grep output sometimes?

P.S. Or we can forgo all that and ask the users who want to search for 
non-ASCII strings to install ripgrep. I've posted a patch which adds its 
support a couple of months ago, and I fully intend to resurrect it 
(mostly for performance reasons, though).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-28 20:16       ` Dmitry Gutov
@ 2020-11-28 20:29         ` Eli Zaretskii
  2020-11-28 21:04           ` Dmitry Gutov
  0 siblings, 1 reply; 35+ messages in thread
From: Eli Zaretskii @ 2020-11-28 20:29 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: stephen.berman, emacs-devel

> Cc: emacs-devel@gnu.org
> From: Dmitry Gutov <dgutov@yandex.ru>
> Date: Sat, 28 Nov 2020 22:16:21 +0200
> 
> >>>    C-x RET c latin-1 RET A ä RET
> >>>
> >>> ?
> >>
> >> Yes (with -a added to the grep invocation, but not without it).  And
> >> then with either 'a' or 'ä' as the search term, *xref* displays 'aä'.
> >> So this seems to be the best workaround, though inconvenient for
> >> frequent uses
> > 
> > I really don't see any other way, especially if different files in the
> > directory have different encodings.  Grep looks for bytes, not
> > characters, and is agnostic to encoding.  And even if we'd do this in
> > Emacs Lisp, we'd still need to trust Emacs to guess/detect the correct
> > encoding of each file.
> 
> Ah, so this way the user explicitly searches for a regexp encoded as 
> latin-1?

More accurately, this is how to search in files encoded in Latin-1.
(The regexp also gets encoded in latin-1, but the important part is
the files' encoding.)

> > Adding -a probably cannot do any harm, but its support should be
> > detected, since I don't think it's portable enough (it isn't in the
> > latest Posix spec, at least).
> 
> Are you sure about that? Are we sure it won't make searching binary 
> files slower, for example?

It will be slower, but more useful: by default Grep just says "Binary
file foo matches".

> Also, the manual has this warning:
> 
> Warning: The -a option might output  binary  garbage,  which  can  have 
> nasty  side effects if the output is a terminal and if the terminal 
> driver interprets some of it as commands.
> 
> ...which might conceivably mess up our parsing of Grep output sometimes?

This is not relevant, since we read that output, there's no terminal
device driver to interpret it and get messed up.

I actually don't think I understand why we need -a in this case, since
Grep looks for null bytes to decide this is a binary file, and encoded
non-ASCII characters don't have null bytes 9except if they are in
UTF-16).

> P.S. Or we can forgo all that and ask the users who want to search for 
> non-ASCII strings to install ripgrep.

We should support Grep regardless, since not everyone will have
ripgrep.  And in any case, "C-x RET c" will be needed with it as well,
no?



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-28 20:29         ` Eli Zaretskii
@ 2020-11-28 21:04           ` Dmitry Gutov
  2020-11-29  0:49             ` Dmitry Gutov
                               ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Dmitry Gutov @ 2020-11-28 21:04 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: stephen.berman, emacs-devel

On 28.11.2020 22:29, Eli Zaretskii wrote:

>> Ah, so this way the user explicitly searches for a regexp encoded as
>> latin-1?
> 
> More accurately, this is how to search in files encoded in Latin-1.
> (The regexp also gets encoded in latin-1, but the important part is
> the files' encoding.)

Right. So when there are files in different encodings, the result will 
be not great, as expected.

>>> Adding -a probably cannot do any harm, but its support should be
>>> detected, since I don't think it's portable enough (it isn't in the
>>> latest Posix spec, at least).
>>
>> Are you sure about that? Are we sure it won't make searching binary
>> files slower, for example?
> 
> It will be slower, but more useful: by default Grep just says "Binary
> file foo matches".

Do we want to search the "binary" files at all? Right now we simply 
filter such matches out (see the definition of xref-matches-in-files), 
and I have seen no complaints.

>> Also, the manual has this warning:
>>
>> Warning: The -a option might output  binary  garbage,  which  can  have
>> nasty  side effects if the output is a terminal and if the terminal
>> driver interprets some of it as commands.
>>
>> ...which might conceivably mess up our parsing of Grep output sometimes?
> 
> This is not relevant, since we read that output, there's no terminal
> device driver to interpret it and get messed up.

Our interpreter is our regexp with which we parse. But I suppose as long 
as Grep doesn't insert unexpected newlines, the parser will be fine.

> I actually don't think I understand why we need -a in this case, since
> Grep looks for null bytes to decide this is a binary file, and encoded
> non-ASCII characters don't have null bytes 9except if they are in
> UTF-16).

Good question.

>> P.S. Or we can forgo all that and ask the users who want to search for
>> non-ASCII strings to install ripgrep.
> 
> We should support Grep regardless, since not everyone will have
> ripgrep.  And in any case, "C-x RET c" will be needed with it as well,
> no?

I'd have to test it explicitly to say for sure, but:

   ripgrep supports searching files in text encodings other than UTF-8,
   such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some
   support for automatically detecting UTF-16 is provided. Other text
   encodings must be specifically specified with the -E/--encoding flag.)

https://blog.burntsushi.net/ripgrep/#pitch

So if the file encoding is UTF-8, UTF-16, or latin-1 (AND the current 
system locale matches that encoding), the search should work fine across 
such files in different encodings, and without 'C-x RET c'.

Which doesn't cover all situations, of course, but it's about as much as 
can be expected. And more than Grep can.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-28 21:04           ` Dmitry Gutov
@ 2020-11-29  0:49             ` Dmitry Gutov
  2020-11-29 15:19               ` Eli Zaretskii
  2020-11-29 15:06             ` Eli Zaretskii
  2020-11-29 19:37             ` Juri Linkov
  2 siblings, 1 reply; 35+ messages in thread
From: Dmitry Gutov @ 2020-11-29  0:49 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: stephen.berman, emacs-devel

On 28.11.2020 23:04, Dmitry Gutov wrote:
> or latin-1 (AND the current system locale matches that encoding), the 
> search should work fine across such files in different encodings, and 
> without 'C-x RET c'

Correction: only utf-8 and utf-16 detection is automatic. latin-1 needs 
explicit arguments '-E latin-1' passed to rg.

The official recommended workaround is to use a --pre flag which is 
similar to what Stephen did originally by inserting 'iconv ...' in the 
shell command string: https://github.com/BurntSushi/ripgrep/issues/746

I suppose if we really wanted, we could insert some custom program that 
chooses what to 'iconv' with, but that would be slower, of course. But 
it could work with Grep, too.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-28 21:04           ` Dmitry Gutov
  2020-11-29  0:49             ` Dmitry Gutov
@ 2020-11-29 15:06             ` Eli Zaretskii
  2020-11-29 15:14               ` Yuri Khan
  2020-11-29 16:07               ` Dmitry Gutov
  2020-11-29 19:37             ` Juri Linkov
  2 siblings, 2 replies; 35+ messages in thread
From: Eli Zaretskii @ 2020-11-29 15:06 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: stephen.berman, emacs-devel

> Cc: stephen.berman@gmx.net, emacs-devel@gnu.org
> From: Dmitry Gutov <dgutov@yandex.ru>
> Date: Sat, 28 Nov 2020 23:04:10 +0200
> 
> >> Are you sure about that? Are we sure it won't make searching binary
> >> files slower, for example?
> > 
> > It will be slower, but more useful: by default Grep just says "Binary
> > file foo matches".
> 
> Do we want to search the "binary" files at all?

We don't.  I still hope to understand why -a was needed in this case.
Stephen?

> > We should support Grep regardless, since not everyone will have
> > ripgrep.  And in any case, "C-x RET c" will be needed with it as well,
> > no?
> 
> I'd have to test it explicitly to say for sure, but:
> 
>    ripgrep supports searching files in text encodings other than UTF-8,
>    such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some
>    support for automatically detecting UTF-16 is provided. Other text
>    encodings must be specifically specified with the -E/--encoding flag.)
> 
> https://blog.burntsushi.net/ripgrep/#pitch

What is not clear to me is whether the _output_ is always in some
fixed encoding, like UTF-8.  That doesn't seem to be stated in the
docs there.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-29 15:06             ` Eli Zaretskii
@ 2020-11-29 15:14               ` Yuri Khan
  2020-11-29 15:36                 ` Stephen Berman
  2020-11-29 15:50                 ` Eli Zaretskii
  2020-11-29 16:07               ` Dmitry Gutov
  1 sibling, 2 replies; 35+ messages in thread
From: Yuri Khan @ 2020-11-29 15:14 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Stephen Berman, Emacs developers, Dmitry Gutov

On Sun, 29 Nov 2020 at 22:07, Eli Zaretskii <eliz@gnu.org> wrote:

> We don't.  I still hope to understand why -a was needed in this case.

The grep manual says it considers files to be binary if it encounters
byte sequences that are not valid text encoded in the locale’s
encoding.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-29  0:49             ` Dmitry Gutov
@ 2020-11-29 15:19               ` Eli Zaretskii
  2020-11-29 16:27                 ` Dmitry Gutov
  0 siblings, 1 reply; 35+ messages in thread
From: Eli Zaretskii @ 2020-11-29 15:19 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: stephen.berman, emacs-devel

> From: Dmitry Gutov <dgutov@yandex.ru>
> Cc: stephen.berman@gmx.net, emacs-devel@gnu.org
> Date: Sun, 29 Nov 2020 02:49:25 +0200
> 
> On 28.11.2020 23:04, Dmitry Gutov wrote:
> > or latin-1 (AND the current system locale matches that encoding), the 
> > search should work fine across such files in different encodings, and 
> > without 'C-x RET c'
> 
> Correction: only utf-8 and utf-16 detection is automatic. latin-1 needs 
> explicit arguments '-E latin-1' passed to rg.
> 
> The official recommended workaround is to use a --pre flag which is 
> similar to what Stephen did originally by inserting 'iconv ...' in the 
> shell command string: https://github.com/BurntSushi/ripgrep/issues/746

How can --pre help?  It still cannot easily support different
encodings in the same command, right?

> I suppose if we really wanted, we could insert some custom program that 
> chooses what to 'iconv' with, but that would be slower, of course. But 
> it could work with Grep, too.

It would be brittle, unless that program actually reads the entire
file (which will be slow).



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-29 15:14               ` Yuri Khan
@ 2020-11-29 15:36                 ` Stephen Berman
  2020-11-29 15:50                 ` Eli Zaretskii
  1 sibling, 0 replies; 35+ messages in thread
From: Stephen Berman @ 2020-11-29 15:36 UTC (permalink / raw)
  To: Yuri Khan; +Cc: Eli Zaretskii, Emacs developers, Dmitry Gutov

On Sun, 29 Nov 2020 22:14:51 +0700 Yuri Khan <yuri.v.khan@gmail.com> wrote:

> On Sun, 29 Nov 2020 at 22:07, Eli Zaretskii <eliz@gnu.org> wrote:
>
>> We don't.  I still hope to understand why -a was needed in this case.
>
> The grep manual says it considers files to be binary if it encounters
> byte sequences that are not valid text encoded in the locale’s
> encoding.

I guess that's the reason.  The files in question are xhtml files, each
beginning with <?xml version="1.0" encoding="iso-8859-1"?>.  Emacs
displays them in XHTML+ mode.  As I noted, my locale is en_US.UTF-8.  I
guess grep doesn't grok iso-8859-1 in that locale, but with -a can at
least find ascii matches.

Steve Berman



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-29 15:14               ` Yuri Khan
  2020-11-29 15:36                 ` Stephen Berman
@ 2020-11-29 15:50                 ` Eli Zaretskii
  1 sibling, 0 replies; 35+ messages in thread
From: Eli Zaretskii @ 2020-11-29 15:50 UTC (permalink / raw)
  To: Yuri Khan; +Cc: stephen.berman, emacs-devel, dgutov

> From: Yuri Khan <yuri.v.khan@gmail.com>
> Date: Sun, 29 Nov 2020 22:14:51 +0700
> Cc: Dmitry Gutov <dgutov@yandex.ru>, Stephen Berman <stephen.berman@gmx.net>, 
> 	Emacs developers <emacs-devel@gnu.org>
> 
> On Sun, 29 Nov 2020 at 22:07, Eli Zaretskii <eliz@gnu.org> wrote:
> 
> > We don't.  I still hope to understand why -a was needed in this case.
> 
> The grep manual says it considers files to be binary if it encounters
> byte sequences that are not valid text encoded in the locale’s
> encoding.

Is that what the Grep source says?



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-29 15:06             ` Eli Zaretskii
  2020-11-29 15:14               ` Yuri Khan
@ 2020-11-29 16:07               ` Dmitry Gutov
  2020-11-29 17:12                 ` Eli Zaretskii
  1 sibling, 1 reply; 35+ messages in thread
From: Dmitry Gutov @ 2020-11-29 16:07 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: stephen.berman, emacs-devel

On 29.11.2020 17:06, Eli Zaretskii wrote:

>> Do we want to search the "binary" files at all?
> 
> We don't.  I still hope to understand why -a was needed in this case.
> Stephen?

Looks like it actually depends on the encoding of the _output_. So if it 
can print some lines well but not others it can even print a line from a 
file and then later say it's a binary:

$ grep "prem" latin1.txt
premie?re is slightly different
Binary file latin1.txt matches

Adding -a or prepending 'LC_ALL=C' changes that:
$ LC_ALL=C grep "prem" latin1.txt
premi�re is first
premie?re is slightly different

So... looks like Grep searches through all files anyway. Just modifies 
its output in cases where it looks iffy.

>>> We should support Grep regardless, since not everyone will have
>>> ripgrep.  And in any case, "C-x RET c" will be needed with it as well,
>>> no?
>>
>> I'd have to test it explicitly to say for sure, but:
>>
>>     ripgrep supports searching files in text encodings other than UTF-8,
>>     such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some
>>     support for automatically detecting UTF-16 is provided. Other text
>>     encodings must be specifically specified with the -E/--encoding flag.)
>>
>> https://blog.burntsushi.net/ripgrep/#pitch
> 
> What is not clear to me is whether the _output_ is always in some
> fixed encoding, like UTF-8.  That doesn't seem to be stated in the
> docs there.

Judging by a small experiment, rg's output is in the same encoding as 
input, for each file. Which can be a nuisance when looking at the search 
results, but that's probably all.

In any case, if one takes the pre-processing route, the end encoding 
will be UTF-8.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-29 15:19               ` Eli Zaretskii
@ 2020-11-29 16:27                 ` Dmitry Gutov
  2020-11-29 17:18                   ` Eli Zaretskii
  0 siblings, 1 reply; 35+ messages in thread
From: Dmitry Gutov @ 2020-11-29 16:27 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: stephen.berman, emacs-devel

On 29.11.2020 17:19, Eli Zaretskii wrote:
>> From: Dmitry Gutov <dgutov@yandex.ru>
>> Cc: stephen.berman@gmx.net, emacs-devel@gnu.org
>> Date: Sun, 29 Nov 2020 02:49:25 +0200
>>
>> On 28.11.2020 23:04, Dmitry Gutov wrote:
>>> or latin-1 (AND the current system locale matches that encoding), the
>>> search should work fine across such files in different encodings, and
>>> without 'C-x RET c'
>>
>> Correction: only utf-8 and utf-16 detection is automatic. latin-1 needs
>> explicit arguments '-E latin-1' passed to rg.
>>
>> The official recommended workaround is to use a --pre flag which is
>> similar to what Stephen did originally by inserting 'iconv ...' in the
>> shell command string: https://github.com/BurntSushi/ripgrep/issues/746
> 
> How can --pre help?  It still cannot easily support different
> encodings in the same command, right?

It can help by calling iconv with different arguments depending on the 
contents of each file. Which is valuable, I think, because we're 
normally not piping file contents to grep (or, potentially, rg), instead 
we pass multiple file names to it using xargs.

That wouldn't be easy, but some script that performs conversion based on 
file contents could work.

>> I suppose if we really wanted, we could insert some custom program that
>> chooses what to 'iconv' with, but that would be slower, of course. But
>> it could work with Grep, too.
> 
> It would be brittle, unless that program actually reads the entire
> file (which will be slow).

How does Emacs do it? Does it read until the end of the file? If not, we 
could try to reuse some of its logic.

Otherwise, yes, our options are either slow or brittle. That might be 
why ripgrep's author decided to offload this responsibility, looking at 
the discussion referenced above.

In any case, --pre will already become significantly slower than the 
current behavior (it will spawn a process for each searched file), so we 
might afford the "slow" approach here because we won't enable it by 
default anyway.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-29 16:07               ` Dmitry Gutov
@ 2020-11-29 17:12                 ` Eli Zaretskii
  2020-11-29 17:19                   ` Dmitry Gutov
  0 siblings, 1 reply; 35+ messages in thread
From: Eli Zaretskii @ 2020-11-29 17:12 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: stephen.berman, emacs-devel

> Cc: stephen.berman@gmx.net, emacs-devel@gnu.org
> From: Dmitry Gutov <dgutov@yandex.ru>
> Date: Sun, 29 Nov 2020 18:07:38 +0200
> 
> Adding -a or prepending 'LC_ALL=C' changes that:
> $ LC_ALL=C grep "prem" latin1.txt
> premi�re is first
> premie?re is slightly different

Is that � what Grep actually produced?

> > What is not clear to me is whether the _output_ is always in some
> > fixed encoding, like UTF-8.  That doesn't seem to be stated in the
> > docs there.
> 
> Judging by a small experiment, rg's output is in the same encoding as 
> input, for each file.

So in this aspect it is not better than Grep: it is still impractical
to search through files that have different encodings.

> In any case, if one takes the pre-processing route, the end encoding 
> will be UTF-8.

But then the pre-processor will have to guess the encoding (if it is
not the same for all the files), which we know is not simple.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-29 16:27                 ` Dmitry Gutov
@ 2020-11-29 17:18                   ` Eli Zaretskii
  2020-11-29 17:32                     ` Dmitry Gutov
  0 siblings, 1 reply; 35+ messages in thread
From: Eli Zaretskii @ 2020-11-29 17:18 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: stephen.berman, emacs-devel

> Cc: stephen.berman@gmx.net, emacs-devel@gnu.org
> From: Dmitry Gutov <dgutov@yandex.ru>
> Date: Sun, 29 Nov 2020 18:27:24 +0200
> 
> > How can --pre help?  It still cannot easily support different
> > encodings in the same command, right?
> 
> It can help by calling iconv with different arguments depending on the 
> contents of each file. Which is valuable, I think, because we're 
> normally not piping file contents to grep (or, potentially, rg), instead 
> we pass multiple file names to it using xargs.
> 
> That wouldn't be easy, but some script that performs conversion based on 
> file contents could work.

It could work in principle, but I think in practice it will not be
faster than doing everything in Emacs Lisp, because each file will
need to be read twice.

> > It would be brittle, unless that program actually reads the entire
> > file (which will be slow).
> 
> How does Emacs do it? Does it read until the end of the file?

No, just a small initial part of it.  That's one reason why the
results are not guaranteed to be correct.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-29 17:12                 ` Eli Zaretskii
@ 2020-11-29 17:19                   ` Dmitry Gutov
  2020-11-29 17:25                     ` Eli Zaretskii
  0 siblings, 1 reply; 35+ messages in thread
From: Dmitry Gutov @ 2020-11-29 17:19 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: stephen.berman, emacs-devel

On 29.11.2020 19:12, Eli Zaretskii wrote:
>> Cc: stephen.berman@gmx.net, emacs-devel@gnu.org
>> From: Dmitry Gutov <dgutov@yandex.ru>
>> Date: Sun, 29 Nov 2020 18:07:38 +0200
>>
>> Adding -a or prepending 'LC_ALL=C' changes that:
>> $ LC_ALL=C grep "prem" latin1.txt
>> premi�re is first
>> premie?re is slightly different
> 
> Is that � what Grep actually produced?

That's copied from a terminal emulator.

If I run it with shell-command, I get this:

premi\350re is first
premie?re is slightly different

(\350 being a raw char)

>>> What is not clear to me is whether the _output_ is always in some
>>> fixed encoding, like UTF-8.  That doesn't seem to be stated in the
>>> docs there.
>>
>> Judging by a small experiment, rg's output is in the same encoding as
>> input, for each file.
> 
> So in this aspect it is not better than Grep: it is still impractical
> to search through files that have different encodings.

It's not optimal, but the important thing is to get matches from all of 
them. Even if some can be printed in a not-so-readable way.

>> In any case, if one takes the pre-processing route, the end encoding
>> will be UTF-8.
> 
> But then the pre-processor will have to guess the encoding (if it is
> not the same for all the files), which we know is not simple.

Yes.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-29 17:19                   ` Dmitry Gutov
@ 2020-11-29 17:25                     ` Eli Zaretskii
  2020-11-29 17:44                       ` Dmitry Gutov
  0 siblings, 1 reply; 35+ messages in thread
From: Eli Zaretskii @ 2020-11-29 17:25 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: stephen.berman, emacs-devel

> Cc: stephen.berman@gmx.net, emacs-devel@gnu.org
> From: Dmitry Gutov <dgutov@yandex.ru>
> Date: Sun, 29 Nov 2020 19:19:43 +0200
> 
> > Is that � what Grep actually produced?
> 
> That's copied from a terminal emulator.
> 
> If I run it with shell-command, I get this:
> 
> premi\350re is first
> premie?re is slightly different
> 
> (\350 being a raw char)

Then I think injecting LC_ALL=C into the environment when running Grep
in this case makes the results more useful?  And we can then avoid
using -a?



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-29 17:18                   ` Eli Zaretskii
@ 2020-11-29 17:32                     ` Dmitry Gutov
  2020-11-29 18:42                       ` Eli Zaretskii
  0 siblings, 1 reply; 35+ messages in thread
From: Dmitry Gutov @ 2020-11-29 17:32 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: stephen.berman, emacs-devel

On 29.11.2020 19:18, Eli Zaretskii wrote:

>> That wouldn't be easy, but some script that performs conversion based on
>> file contents could work.
> 
> It could work in principle, but I think in practice it will not be
> faster than doing everything in Emacs Lisp, because each file will
> need to be read twice.

It will certainly be faster if the host if remote.

On a local machine, you might be right, but we'd have to benchmark to be 
sure.

If the calls to the conversion program are done in parallel to the 
subsequent searches, reading the file twice might not be a problem (with 
the benefit of a disk cache). And if rg itself performs the search 
faster than Emacs' regexp engine, that can also be a factor.

Depends on process spawning overhead, I suppose.

>>> It would be brittle, unless that program actually reads the entire
>>> file (which will be slow).
>>
>> How does Emacs do it? Does it read until the end of the file?
> 
> No, just a small initial part of it.  That's one reason why the
> results are not guaranteed to be correct.

But if we consider that approach good enough for Emacs, it should 
probably be good enough for doing a search from inside Emacs.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-29 17:25                     ` Eli Zaretskii
@ 2020-11-29 17:44                       ` Dmitry Gutov
  2020-11-29 18:51                         ` Eli Zaretskii
  0 siblings, 1 reply; 35+ messages in thread
From: Dmitry Gutov @ 2020-11-29 17:44 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: stephen.berman, emacs-devel

On 29.11.2020 19:25, Eli Zaretskii wrote:
>> Cc: stephen.berman@gmx.net, emacs-devel@gnu.org
>> From: Dmitry Gutov <dgutov@yandex.ru>
>> Date: Sun, 29 Nov 2020 19:19:43 +0200
>>
>>> Is that � what Grep actually produced?
>>
>> That's copied from a terminal emulator.
>>
>> If I run it with shell-command, I get this:
>>
>> premi\350re is first
>> premie?re is slightly different
>>
>> (\350 being a raw char)
> 
> Then I think injecting LC_ALL=C into the environment when running Grep
> in this case makes the results more useful?  And we can then avoid
> using -a?

I'm not so sure. LC_ALL=C seems more problematic than -a:

$ grep ф test.txt
фыва
$ grep -a ф test.txt
фыва
$ LC_ALL=C grep ф test.txt
(nothing)

Curiously,

   LC_ALL=C grep première latin1.txt

works just fine with my terminal emulator, but that probably because it 
decodes the multibyte search string under the covers before using it as 
argument. It doesn't work in Emacs without 'C-x RET c'.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-29 17:32                     ` Dmitry Gutov
@ 2020-11-29 18:42                       ` Eli Zaretskii
  2020-11-29 19:48                         ` Dmitry Gutov
  0 siblings, 1 reply; 35+ messages in thread
From: Eli Zaretskii @ 2020-11-29 18:42 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: stephen.berman, emacs-devel

> Cc: stephen.berman@gmx.net, emacs-devel@gnu.org
> From: Dmitry Gutov <dgutov@yandex.ru>
> Date: Sun, 29 Nov 2020 19:32:17 +0200
> 
> If the calls to the conversion program are done in parallel to the 
> subsequent searches, reading the file twice might not be a problem (with 
> the benefit of a disk cache).

How do you mean "in parallel"?  You cannot start searching until you
decide on the encoding, so it must not be in parallel.

> >> How does Emacs do it? Does it read until the end of the file?
> > 
> > No, just a small initial part of it.  That's one reason why the
> > results are not guaranteed to be correct.
> 
> But if we consider that approach good enough for Emacs, it should 
> probably be good enough for doing a search from inside Emacs.

It's good enough when the encoding is the locale's codeset, and in a
few other (not very important) cases.  For an arbitrary combination of
file's encoding and locale's codeset, the result can be wrong every
single time.

And searching in non-ASCII files whose encoding is not the locale's
native one is precisely the case where this will fail.  Granted, it's
a relatively rare use case, but when it does happen, all bets are off.

So reading just a small part, as Emacs does, will yield similar
percentage of wrong guesses.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-29 17:44                       ` Dmitry Gutov
@ 2020-11-29 18:51                         ` Eli Zaretskii
  2020-11-29 19:07                           ` Dmitry Gutov
  2020-11-29 19:49                           ` Gregory Heytings via Emacs development discussions.
  0 siblings, 2 replies; 35+ messages in thread
From: Eli Zaretskii @ 2020-11-29 18:51 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: stephen.berman, emacs-devel

> Cc: stephen.berman@gmx.net, emacs-devel@gnu.org
> From: Dmitry Gutov <dgutov@yandex.ru>
> Date: Sun, 29 Nov 2020 19:44:57 +0200
> 
> > Then I think injecting LC_ALL=C into the environment when running Grep
> > in this case makes the results more useful?  And we can then avoid
> > using -a?
> 
> I'm not so sure. LC_ALL=C seems more problematic than -a:
> 
> $ grep ф test.txt
> фыва
> $ grep -a ф test.txt
> фыва
> $ LC_ALL=C grep ф test.txt
> (nothing)

I guess this regression in Grep happened when they "internationalized"
the DFA code, sigh...

It almost sounds like we should develop our own replacement for Grep,
one that doesn't suffer from these problems.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-29 18:51                         ` Eli Zaretskii
@ 2020-11-29 19:07                           ` Dmitry Gutov
  2020-11-29 19:32                             ` Eli Zaretskii
  2020-11-29 19:49                             ` Stephen Berman
  2020-11-29 19:49                           ` Gregory Heytings via Emacs development discussions.
  1 sibling, 2 replies; 35+ messages in thread
From: Dmitry Gutov @ 2020-11-29 19:07 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: stephen.berman, emacs-devel

On 29.11.2020 20:51, Eli Zaretskii wrote:
>> I'm not so sure. LC_ALL=C seems more problematic than -a:
>>
>> $ grep ф test.txt
>> фыва
>> $ grep -a ф test.txt
>> фыва
>> $ LC_ALL=C grep ф test.txt
>> (nothing)
> I guess this regression in Grep happened when they "internationalized"
> the DFA code, sigh...

Sorry, I double-checked, and it seems to have been caused by my terminal 
emulator too: if I set LC_ALL in Emacs and do a search through 
shell-command or dired-do-find-regexp, it succeeds.

You might want to verify this yourself, though.

> It almost sounds like we should develop our own replacement for Grep,
> one that doesn't suffer from these problems.

If we were going to bundle a new tool, we could pick some existing one. 
Perhaps one that has already been mentioned in this conversation ;-)



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-29 19:07                           ` Dmitry Gutov
@ 2020-11-29 19:32                             ` Eli Zaretskii
  2020-11-29 19:34                               ` Eli Zaretskii
  2020-11-29 19:49                             ` Stephen Berman
  1 sibling, 1 reply; 35+ messages in thread
From: Eli Zaretskii @ 2020-11-29 19:32 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: stephen.berman, emacs-devel

> Cc: stephen.berman@gmx.net, emacs-devel@gnu.org
> From: Dmitry Gutov <dgutov@yandex.ru>
> Date: Sun, 29 Nov 2020 21:07:49 +0200
> 
> > It almost sounds like we should develop our own replacement for Grep,
> > one that doesn't suffer from these problems.
> 
> If we were going to bundle a new tool, we could pick some existing one. 
> Perhaps one that has already been mentioned in this conversation ;-)

The internals should work like Emacs, not like Grep, though.  That is,
instead of relying on libc, which uses functions sensitive to the
locale, it should use locale-independent regexp search we have in
Emacs.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-29 19:32                             ` Eli Zaretskii
@ 2020-11-29 19:34                               ` Eli Zaretskii
  0 siblings, 0 replies; 35+ messages in thread
From: Eli Zaretskii @ 2020-11-29 19:34 UTC (permalink / raw)
  To: dgutov; +Cc: stephen.berman, emacs-devel

> Date: Sun, 29 Nov 2020 21:32:40 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: stephen.berman@gmx.net, emacs-devel@gnu.org
> 
> The internals should work like Emacs, not like Grep, though.  That is,
> instead of relying on libc, which uses functions sensitive to the
> locale, it should use locale-independent regexp search we have in
> Emacs.

And the output should be in UTF-8 regardless of the input encoding.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-28 21:04           ` Dmitry Gutov
  2020-11-29  0:49             ` Dmitry Gutov
  2020-11-29 15:06             ` Eli Zaretskii
@ 2020-11-29 19:37             ` Juri Linkov
  2020-11-30  1:08               ` Dmitry Gutov
  2 siblings, 1 reply; 35+ messages in thread
From: Juri Linkov @ 2020-11-29 19:37 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: Eli Zaretskii, stephen.berman, emacs-devel

>>>> Adding -a probably cannot do any harm, but its support should be
>>>> detected, since I don't think it's portable enough (it isn't in the
>>>> latest Posix spec, at least).
>>>
>>> Are you sure about that? Are we sure it won't make searching binary
>>> files slower, for example?
>> It will be slower, but more useful: by default Grep just says "Binary
>> file foo matches".
>
> Do we want to search the "binary" files at all? Right now we simply filter
> such matches out (see the definition of xref-matches-in-files), and I have
> seen no complaints.

There are two cases: a really binary file, and a legit ascii file
with an occasional ^@ char.  And grep can't distinguish one from another.
There is an option --binary-files=binary, but unfortunately it doesn't help,
it still outputs "Binary file matches".

So xref parser needs to be smart enough to detect whether the matched line
contains binary garbage when '-a' is used, or it's purely ascii.

Moreover, I think we should apply the same heuristics to the grep output
in grep.el and add '-a' to the grep command by default.  Then grep.el
should prettify the lines with real binary garbage e.g. by hiding groups of
bytes between 0 and 32, or adding a 'display' property with ellipsis.

>>> Also, the manual has this warning:
>>>
>>> Warning: The -a option might output  binary  garbage,  which  can  have
>>> nasty  side effects if the output is a terminal and if the terminal
>>> driver interprets some of it as commands.
>>>
>>> ...which might conceivably mess up our parsing of Grep output sometimes?
>> This is not relevant, since we read that output, there's no terminal
>> device driver to interpret it and get messed up.
>
> Our interpreter is our regexp with which we parse. But I suppose as long as
> Grep doesn't insert unexpected newlines, the parser will be fine.

For grep output a bigger problem is that grep on binary data
might output too long lines before the terminating newline.

>> I actually don't think I understand why we need -a in this case, since
>> Grep looks for null bytes to decide this is a binary file, and encoded
>> non-ASCII characters don't have null bytes 9except if they are in
>> UTF-16).
>
> Good question.

The grep manual says that binary data are either output bytes that
are improperly encoded for the current locale, or null input bytes.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-29 18:42                       ` Eli Zaretskii
@ 2020-11-29 19:48                         ` Dmitry Gutov
  0 siblings, 0 replies; 35+ messages in thread
From: Dmitry Gutov @ 2020-11-29 19:48 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: stephen.berman, emacs-devel

On 29.11.2020 20:42, Eli Zaretskii wrote:
>> Cc: stephen.berman@gmx.net, emacs-devel@gnu.org
>> From: Dmitry Gutov <dgutov@yandex.ru>
>> Date: Sun, 29 Nov 2020 19:32:17 +0200
>>
>> If the calls to the conversion program are done in parallel to the
>> subsequent searches, reading the file twice might not be a problem (with
>> the benefit of a disk cache).
> 
> How do you mean "in parallel"?  You cannot start searching until you
> decide on the encoding, so it must not be in parallel.

Since we're passing multiple files to Grep or RG at the same time, it 
could start deciding on the encoding of the next file while still 
searching the previous one.

>>>> How does Emacs do it? Does it read until the end of the file?
>>>
>>> No, just a small initial part of it.  That's one reason why the
>>> results are not guaranteed to be correct.
>>
>> But if we consider that approach good enough for Emacs, it should
>> probably be good enough for doing a search from inside Emacs.
> 
> It's good enough when the encoding is the locale's codeset, and in a
> few other (not very important) cases.  For an arbitrary combination of
> file's encoding and locale's codeset, the result can be wrong every
> single time.
> 
> And searching in non-ASCII files whose encoding is not the locale's
> native one is precisely the case where this will fail.  Granted, it's
> a relatively rare use case, but when it does happen, all bets are off.

Which will likely have affected the user (who is foremost an Emacs user) 
already, before he/did the search.

> So reading just a small part, as Emacs does, will yield similar
> percentage of wrong guesses.

...so that seems like a good thing.

Anyway, that should work but you don't seem to be crazy about the 
approach, and I'm not in love with the potential implementation. So 
maybe we should stop and let it brew for a little while.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-29 18:51                         ` Eli Zaretskii
  2020-11-29 19:07                           ` Dmitry Gutov
@ 2020-11-29 19:49                           ` Gregory Heytings via Emacs development discussions.
  1 sibling, 0 replies; 35+ messages in thread
From: Gregory Heytings via Emacs development discussions. @ 2020-11-29 19:49 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Dmitry Gutov, stephen.berman, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1209 bytes --]


>>> Then I think injecting LC_ALL=C into the environment when running Grep 
>>> in this case makes the results more useful?  And we can then avoid 
>>> using -a?
>>
>> I'm not so sure. LC_ALL=C seems more problematic than -a:
>>
>> $ grep ф test.txt
>> фыва
>> $ grep -a ф test.txt
>> фыва
>> $ LC_ALL=C grep ф test.txt
>> (nothing)
>
> I guess this regression in Grep happened when they "internationalized" 
> the DFA code, sigh...
>

FWIW, I "bisected" this with various versions of grep, and this regression 
happened in 2014, between versions 2.20 and 2.21:

echo -ne "premi\xE8re\n" > latin1.txt
echo -ne "premi\xC3\xA8re\n" > utf8.txt
echo -ne "premi\xE8re\npremi\xC3\xA8re\n" > both.txt

With 2.20 with rxvt (which is clever enough to display UTF-8 and Latin-1 at the same time):
$ grep prem *.txt
both.txt:première
both.txt:première
latin1.txt:première
utf8.txt:première

With 2.20 with M-x shell (the \350 is a single character):
both.txt:premi\350re
both.txt:première
latin1.txt:premi\350re
utf8.txt:première

With 2.21, with rxvt or M-x shell:
grep prem *.txt
Binary file both.txt matches
Binary file latin1.txt matches
utf8.txt:première

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-29 19:07                           ` Dmitry Gutov
  2020-11-29 19:32                             ` Eli Zaretskii
@ 2020-11-29 19:49                             ` Stephen Berman
  1 sibling, 0 replies; 35+ messages in thread
From: Stephen Berman @ 2020-11-29 19:49 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: Eli Zaretskii, emacs-devel

On Sun, 29 Nov 2020 21:07:49 +0200 Dmitry Gutov <dgutov@yandex.ru> wrote:

> On 29.11.2020 20:51, Eli Zaretskii wrote:
>>> I'm not so sure. LC_ALL=C seems more problematic than -a:
>>>
>>> $ grep ф test.txt
>>> фыва
>>> $ grep -a ф test.txt
>>> фыва
>>> $ LC_ALL=C grep ф test.txt
>>> (nothing)
>> I guess this regression in Grep happened when they "internationalized"
>> the DFA code, sigh...
>
> Sorry, I double-checked, and it seems to have been caused by my terminal
> emulator too: if I set LC_ALL in Emacs and do a search through shell-command
> or dired-do-find-regexp, it succeeds.
>
> You might want to verify this yourself, though.

FWIW, here with both xterm and xfce4-terminal `LC_ALL=C grep ф test.txt'
returns `фыва'.  My locale is en_US.UTF-8.

Steve Berman



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-29 19:37             ` Juri Linkov
@ 2020-11-30  1:08               ` Dmitry Gutov
  2020-11-30 20:54                 ` Juri Linkov
  0 siblings, 1 reply; 35+ messages in thread
From: Dmitry Gutov @ 2020-11-30  1:08 UTC (permalink / raw)
  To: Juri Linkov; +Cc: Eli Zaretskii, stephen.berman, emacs-devel

On 29.11.2020 21:37, Juri Linkov wrote:

>> Do we want to search the "binary" files at all? Right now we simply filter
>> such matches out (see the definition of xref-matches-in-files), and I have
>> seen no complaints.
> 
> There are two cases: a really binary file, and a legit ascii file
> with an occasional ^@ char.  And grep can't distinguish one from another.
> There is an option --binary-files=binary, but unfortunately it doesn't help,
> it still outputs "Binary file matches".

Makes sense.

> So xref parser needs to be smart enough to detect whether the matched line
> contains binary garbage when '-a' is used, or it's purely ascii.

I guess we can do that, but then some people might be a bit unhappy 
about not being able to search inside such files? It could be useful on 
occasion, too (TBC below *).

> Moreover, I think we should apply the same heuristics to the grep output
> in grep.el and add '-a' to the grep command by default.

I guess we should. Or do the LC_ALL thing. I'm still unclear on the 
difference in effect between the two.

> Then grep.el
> should prettify the lines with real binary garbage e.g. by hiding groups of
> bytes between 0 and 32, or adding a 'display' property with ellipsis.

Why not. xref could also do something like that.

>> Our interpreter is our regexp with which we parse. But I suppose as long as
>> Grep doesn't insert unexpected newlines, the parser will be fine.
> 
> For grep output a bigger problem is that grep on binary data
> might output too long lines before the terminating newline.

(*) We already have this kind of problem with "normal" files which 
contain minified assets (JS or CSS). The file contents are usually 
normal ASCII, but it's just one line which can reach several MBs in length.

The usual way to deal with that is with project-ignores and 
grep-find-ignored-files. That works for both cases.

>>> I actually don't think I understand why we need -a in this case, since
>>> Grep looks for null bytes to decide this is a binary file, and encoded
>>> non-ASCII characters don't have null bytes 9except if they are in
>>> UTF-16).
>>
>> Good question.
> 
> The grep manual says that binary data are either output bytes that
> are improperly encoded for the current locale, or null input bytes.

So... if we add LC_ALL=C but not '-a' we will allow the "improperly 
encoded" case but not the "null input bytes" one?



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-30  1:08               ` Dmitry Gutov
@ 2020-11-30 20:54                 ` Juri Linkov
  2020-12-01  0:34                   ` Dmitry Gutov
  0 siblings, 1 reply; 35+ messages in thread
From: Juri Linkov @ 2020-11-30 20:54 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: Eli Zaretskii, stephen.berman, emacs-devel

>> For grep output a bigger problem is that grep on binary data
>> might output too long lines before the terminating newline.
>
> (*) We already have this kind of problem with "normal" files which contain
> minified assets (JS or CSS). The file contents are usually normal ASCII,
> but it's just one line which can reach several MBs in length.
>
> The usual way to deal with that is with project-ignores and
> grep-find-ignored-files. That works for both cases.

This is a bug problem - often grep output lines are so long
that Emacs freezes, so need to kill the process.  Updating
manually ignored-files every time a new file causes freeze
is very unreliable and time-consuming workaround.

I tried to fix this problem, and fortunately the fix is simple
with the 1-liner patch.

It does exactly the same thing that we recently did to hide
overly long grep command lines with 'grep-find-abbreviate'.
The patch even uses the same 'grep-find-abbreviate-properties'
to allow clicking the hidden part to expand it.

diff --git a/lisp/progmodes/grep.el b/lisp/progmodes/grep.el
index dafba22f77..e0df2402ee 100644
--- a/lisp/progmodes/grep.el
+++ b/lisp/progmodes/grep.el
@@ -492,6 +492,9 @@ grep-mode-font-lock-keywords
       (0 grep-context-face)
       (1 (if (eq (char-after (match-beginning 1)) ?\0)
              `(face nil display ,(match-string 2)))))
+     ;; Hide excessive parts of grep output lines
+     ("^.+?:.\\{,64\\}\\(.*\\).\\{10\\}$"
+      1 grep-find-abbreviate-properties)
      ;; Hide excessive part of rgrep command
      ("^find \\(\\. -type d .*\\\\)\\)"
       (1 (if grep-find-abbreviate grep-find-abbreviate-properties

More customizability could be added later to define the
length of the hidden part, etc.



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: dired-do-find-regexp failure with latin-1 encoding
  2020-11-30 20:54                 ` Juri Linkov
@ 2020-12-01  0:34                   ` Dmitry Gutov
  0 siblings, 0 replies; 35+ messages in thread
From: Dmitry Gutov @ 2020-12-01  0:34 UTC (permalink / raw)
  To: Juri Linkov; +Cc: Eli Zaretskii, stephen.berman, emacs-devel

On 30.11.2020 22:54, Juri Linkov wrote:
>>> For grep output a bigger problem is that grep on binary data
>>> might output too long lines before the terminating newline.
>>
>> (*) We already have this kind of problem with "normal" files which contain
>> minified assets (JS or CSS). The file contents are usually normal ASCII,
>> but it's just one line which can reach several MBs in length.
>>
>> The usual way to deal with that is with project-ignores and
>> grep-find-ignored-files. That works for both cases.
> 
> This is a bug problem - often grep output lines are so long
> that Emacs freezes, so need to kill the process.  Updating
> manually ignored-files every time a new file causes freeze
> is very unreliable and time-consuming workaround.

And a non-obvious one (for an average user).

Is the same problem exhibited by commands using the Xref UI? I don't 
remember seeing it, but of course our projects can be very different.

> I tried to fix this problem, and fortunately the fix is simple
> with the 1-liner patch.
> 
> It does exactly the same thing that we recently did to hide
> overly long grep command lines with 'grep-find-abbreviate'.
> The patch even uses the same 'grep-find-abbreviate-properties'
> to allow clicking the hidden part to expand it.
> 
> diff --git a/lisp/progmodes/grep.el b/lisp/progmodes/grep.el
> index dafba22f77..e0df2402ee 100644
> --- a/lisp/progmodes/grep.el
> +++ b/lisp/progmodes/grep.el
> @@ -492,6 +492,9 @@ grep-mode-font-lock-keywords
>         (0 grep-context-face)
>         (1 (if (eq (char-after (match-beginning 1)) ?\0)
>                `(face nil display ,(match-string 2)))))
> +     ;; Hide excessive parts of grep output lines
> +     ("^.+?:.\\{,64\\}\\(.*\\).\\{10\\}$"
> +      1 grep-find-abbreviate-properties)
>        ;; Hide excessive part of rgrep command
>        ("^find \\(\\. -type d .*\\\\)\\)"
>         (1 (if grep-find-abbreviate grep-find-abbreviate-properties

Looks sensible to me, but perhaps you want to create a new 
discussion/bug-number for it? Unless you'd like to follow up with a 
patch for xref.el (if the problem applies there).

> More customizability could be added later to define the
> length of the hidden part, etc.

Maybe we'll want it to be dynamically determined by fill-column.

Or just be a big enough value (e.g. 256) that the only lines where this 
rule is hit are obviously too long.



^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2020-12-01  0:34 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-28 18:03 dired-do-find-regexp failure with latin-1 encoding Stephen Berman
2020-11-28 18:11 ` Eli Zaretskii
2020-11-28 18:46   ` Stephen Berman
2020-11-28 19:13     ` Eli Zaretskii
2020-11-28 19:44       ` Stephen Berman
2020-11-28 19:49         ` Eli Zaretskii
2020-11-28 20:16       ` Dmitry Gutov
2020-11-28 20:29         ` Eli Zaretskii
2020-11-28 21:04           ` Dmitry Gutov
2020-11-29  0:49             ` Dmitry Gutov
2020-11-29 15:19               ` Eli Zaretskii
2020-11-29 16:27                 ` Dmitry Gutov
2020-11-29 17:18                   ` Eli Zaretskii
2020-11-29 17:32                     ` Dmitry Gutov
2020-11-29 18:42                       ` Eli Zaretskii
2020-11-29 19:48                         ` Dmitry Gutov
2020-11-29 15:06             ` Eli Zaretskii
2020-11-29 15:14               ` Yuri Khan
2020-11-29 15:36                 ` Stephen Berman
2020-11-29 15:50                 ` Eli Zaretskii
2020-11-29 16:07               ` Dmitry Gutov
2020-11-29 17:12                 ` Eli Zaretskii
2020-11-29 17:19                   ` Dmitry Gutov
2020-11-29 17:25                     ` Eli Zaretskii
2020-11-29 17:44                       ` Dmitry Gutov
2020-11-29 18:51                         ` Eli Zaretskii
2020-11-29 19:07                           ` Dmitry Gutov
2020-11-29 19:32                             ` Eli Zaretskii
2020-11-29 19:34                               ` Eli Zaretskii
2020-11-29 19:49                             ` Stephen Berman
2020-11-29 19:49                           ` Gregory Heytings via Emacs development discussions.
2020-11-29 19:37             ` Juri Linkov
2020-11-30  1:08               ` Dmitry Gutov
2020-11-30 20:54                 ` Juri Linkov
2020-12-01  0:34                   ` Dmitry Gutov

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).