Text copied from *grep* buffer has NUL (0x00) characters

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* Text copied from *grep* buffer has NUL (0x00) characters
       [not found] <ddcebd62-1eb7-d24c-9a85-dadeb62c6ea2.ref@yahoo.de>
@ 2021-05-09  9:19 ` R. Diez
  2021-05-09 10:01   ` Eli Zaretskii
  2021-05-09 13:57   ` Stefan Monnier via Users list for the GNU Emacs text editor
  0 siblings, 2 replies; 8+ messages in thread
From: R. Diez @ 2021-05-09  9:19 UTC (permalink / raw)
  To: help-gnu-emacs

Hi all:

I have been using encoding utf-8-with-signature-dos for years with my main notes.txt file, because it is very portable. Even ancient versions of 
Windows Notepad honour the UTF-8 BOM correctly.

Recently, my notes.txt became corrupt a few times. I started seeing ^M characters at the end of each line, and other text editors started complaining 
about invalid UTF-8 sequences inside.

I thought my network connection was unreliable, or maybe my local disk, or Emacs had a bug. Restoring the notes.txt file wasn't easy, because it was 
not obvious what was wrong with it. I couldn't find a command-line tool that would easily replace any invalid UTF-8 sequences with their hex code 
equivalents, but I must admit that I did not actually invest much time looking. After all, I have automated backups.

Yesterday, I remembered exactly what I had done last: I had copied text from the *grep* buffer after using 'rgrep'.

After some investigation, it turns out Emacs' default "Grep Command" is "grep --color -nH --null -e ", which includes option "--null". This means that 
grep is embedding an ASCII NUL character (a binary 0x00) after the filenames.

This is what an rgrep text search occurrence looks like in the *grep* buffer:

./some/file.txt:123:some text line

The first ':' is actually a binary null, but the *grep* buffer hides this fact.

If you copy that text line to an Emacs text file buffer, it then looks like this:

./some/file.txt^@123:some text line

The ^@ is the representation for the binary null. With my preliminary testing, I could not reproduce the kind of text file "corruption" I had seen 
before, but other text editors started complaining again about an invalid UTF-8 sequence or the like.

For example, the MATE Desktop text editor, Pluma, complained about an "incomplete multibyte sequence in input". Pluma refuses to open short files with 
embedded NUL characters because it cannot detect the character encoding, or because it claims that it looks like a binary file. Merge tool 'Meld' also 
complained about invalid characters.

I would say that Emacs has 2 issues here:

1) If a text file encoding is utf-8-with-signature-dos, I do not think that it is a good idea for Emacs to allow binary zeros without any warning.

A character sequence like ^@ is easy to miss in the middle of long text lines, as it is not coloured in red and does not have any other visible hint.

A 0x00 may well be a valid UTF-8 character, but it is probably going to cause problems in many places. This kind of problem is not new, see also 
"modified UTF-8".

I think that I have seen warnings from Emacs before about characters that could not be encoded in the current buffer encoding. I would welcome such a 
warning for binary zeros.

2) Copying text from a *grep* buffer that looks like ":" should not suddenly deliver a NUL character instead. That's just unexpected and prone to 
problems down the line.

Regards,
   rdiez

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Text copied from *grep* buffer has NUL (0x00) characters
  2021-05-09  9:19 ` Text copied from *grep* buffer has NUL (0x00) characters R. Diez
@ 2021-05-09 10:01   ` Eli Zaretskii
  2021-05-09 18:47     ` R. Diez
  2021-05-09 13:57   ` Stefan Monnier via Users list for the GNU Emacs text editor
  1 sibling, 1 reply; 8+ messages in thread
From: Eli Zaretskii @ 2021-05-09 10:01 UTC (permalink / raw)
  To: help-gnu-emacs

> From: "R. Diez" <rdiezmail-emacs@yahoo.de>
> Date: Sun, 9 May 2021 11:19:38 +0200
> 
> I would say that Emacs has 2 issues here:
> 
> 
> 1) If a text file encoding is utf-8-with-signature-dos, I do not think that it is a good idea for Emacs to allow binary zeros without any warning.

There's nothing wrong with null bytes in a UTF-8 encoded file, not in
general.  We could have an optional warning about null bytes (when?
when you save the buffer?).  But I see no reason to do that by
default, especially since such a feature would require a costly search
of the entire buffer.

> 2) Copying text from a *grep* buffer that looks like ":" should not suddenly deliver a NUL character instead. That's just unexpected and prone to 
> problems down the line.

This is easy to fix: customize the Grep command to not include
"--null".  That switch is mainly for systems that allow newlines in
file names, which MS-Windows doesn't allow, so if this switch causes
trouble in your usage, simply remove it.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Text copied from *grep* buffer has NUL (0x00) characters
  2021-05-09  9:19 ` Text copied from *grep* buffer has NUL (0x00) characters R. Diez
  2021-05-09 10:01   ` Eli Zaretskii
@ 2021-05-09 13:57   ` Stefan Monnier via Users list for the GNU Emacs text editor
  1 sibling, 0 replies; 8+ messages in thread
From: Stefan Monnier via Users list for the GNU Emacs text editor @ 2021-05-09 13:57 UTC (permalink / raw)
  To: help-gnu-emacs

> 2) Copying text from a *grep* buffer that looks like ":" should not suddenly
>    deliver a NUL character instead. That's just unexpected and prone to
>    problems down the line.

This "what you see in NOT what you get" is indeed undesirable.  I'm not
sure it's easy to fix in a reliable way in Emacs (beside not using
`--null` as Eli points out), but I suggest you `M-x report-emacs-bug`.
Maybe grep-mode can add a `filter-buffer-substring-function` that
converts those NUL into `:`.

For the detection of NULs in UTF-8 files, you could also ask for such
a feature via `M-x report-emacs-bug` but it should be pretty easy to get
something comparable with something like:

    (defun my-utf-8-nul-check ()
      (save-excursion
        (goto-char (point-min))
        (when (search-forward "\000" nil t)
          (error "NUL!!"))))
    (add-hook 'before-save-hook #'my-utf-8-nul-check)

-- Stefan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Text copied from *grep* buffer has NUL (0x00) characters
  2021-05-09 10:01   ` Eli Zaretskii
@ 2021-05-09 18:47     ` R. Diez
  2021-05-09 18:57       ` Eli Zaretskii
  2021-05-09 19:47       ` Stefan Monnier
  0 siblings, 2 replies; 8+ messages in thread
From: R. Diez @ 2021-05-09 18:47 UTC (permalink / raw)
  To: Eli Zaretskii, monnier; +Cc: help-gnu-emacs

> There's nothing wrong with null bytes in a UTF-8 encoded file, not in
> general.

Well, that's true by the book.

I already mentioned Meld and Pluma. Xfce's text editor, Mousepad, refuses too to open UTF-8 files with BOM if they contain a NUL character.

gedit at least used to have the same problem:
https://superuser.com/questions/246014/use-gedit-to-open-file-with-null-characters

Geany truncates the file at the first NUL.

So it is a problem in practice.

But we could of course insist on everyone switching to a proper text editor when they try to open our UTF-8 files with embedded NULs. That will surely 
make us even more popular... ]8-)

> We could have an optional warning about null bytes (when?
> when you save the buffer?).  But I see no reason to do that by
> default, especially since such a feature would require a costly search
> of the entire buffer.

Some terminal emulators warn when pasting suspicious text.

Emacs is already checking all bytes on save. I inserted an invalid sequence and got this warning on save:

-------------8<-------------8<-------------
These default coding systems were tried to encode text
in the buffer ‘Test3.txt’:
   (utf-8-with-signature-dos (11 . 4194176) (12 . 4194239))
However, each of them encountered characters it couldn’t encode:
   utf-8-with-signature-dos cannot encode these: \200 \277

Click on a character (or switch to this window by ‘C-x o’
and select the characters by RET) to jump to the place it appears,
where ‘M-x universal-argument C-x =’ will give information about it.

Select one of the safe coding systems listed below,
or cancel the writing with C-g and edit the buffer
    to remove or modify the problematic characters,
or specify any other coding system (and risk losing
    the problematic characters).

   raw-text no-conversion
-------------8<-------------8<-------------

Therefore, I don't think it would cost too much to check for NULs at the same time, and give users the choice.

> This is easy to fix: customize the Grep command to not include
> "--null".  That switch is mainly for systems that allow newlines in
> file names, which MS-Windows doesn't allow, so if this switch causes
> trouble in your usage, simply remove it.

I am using Linux. Of course, now that I know what the issue is, I can just remove --null from the grep command and be done with it. That would quietly 
fix the problem for me.

The reason I wrote a long e-mail is to illustrate my head scratching when I got hit several times, because it is not obvious where the problem is 
coming from.

I'll post again if I manage to reproduce a more serious variant of this issue where the file started to show Chinese characters in other editors, 
while Emacs decided to start showing ^M at the end of the lines. My guess is that it was a similar gotcha, because I have been copying from the *grep* 
buffer a few times in the last days.

I believe that this NUL gotcha is going to hit many people, who will then think "this is just another Emacs quirk". After all, the grep --null is a 
relatively recent change in Emacs 26.1 . And many log files have embedded NUL characters too, so you may inadvertently copy NUL characters along.

 > For the detection of NULs in UTF-8 files, you could also ask for such
 > a feature via `M-x report-emacs-bug` but it should be pretty easy to get
 > something comparable with something like:
 > [...]

I don't think it is desirable for users to install such Lisp hooks to deal with such corner cases. My opinion is that Emacs should be more helpful 
here by default. But maybe this mailing list post is enough, if users facing such "corruption" or character encoding problems manage to enter the 
right search terms.

 > This "what you see in NOT what you get" is indeed undesirable.  I'm not
 > sure it's easy to fix in a reliable way in Emacs (beside not using
 > `--null` as Eli points out), but I suggest you `M-x report-emacs-bug`.
 > Maybe grep-mode can add a `filter-buffer-substring-function` that
 > converts those NUL into `:`.

That seems fair. I'll report that as a bug.

Regards,
   rdiez

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Text copied from *grep* buffer has NUL (0x00) characters
  2021-05-09 18:47     ` R. Diez
@ 2021-05-09 18:57       ` Eli Zaretskii
  2021-05-09 21:13         ` R. Diez
  2021-05-09 19:47       ` Stefan Monnier
  1 sibling, 1 reply; 8+ messages in thread
From: Eli Zaretskii @ 2021-05-09 18:57 UTC (permalink / raw)
  To: help-gnu-emacs

> From: "R. Diez" <rdiezmail-emacs@yahoo.de>
> Cc: help-gnu-emacs@gnu.org
> Date: Sun, 9 May 2021 20:47:28 +0200
> 
> Emacs is already checking all bytes on save. I inserted an invalid sequence and got this warning on save:

That's not the same.  the warning you saw is triggered by a failure to
convert to the external encoding, so it consumes no extra CPU cycles.
Null bytes will not fail anything, so you should test for them
explicitly (and in some encodings, like UTF-16, they are necessary and
cannot be avoided).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Text copied from *grep* buffer has NUL (0x00) characters
  2021-05-09 18:47     ` R. Diez
  2021-05-09 18:57       ` Eli Zaretskii
@ 2021-05-09 19:47       ` Stefan Monnier
  1 sibling, 0 replies; 8+ messages in thread
From: Stefan Monnier @ 2021-05-09 19:47 UTC (permalink / raw)
  To: R. Diez; +Cc: Eli Zaretskii, help-gnu-emacs

>> For the detection of NULs in UTF-8 files, you could also ask for such
>> a feature via `M-x report-emacs-bug` but it should be pretty easy to get
>> something comparable with something like:
>> [...]
>
> I don't think it is desirable for users to install such Lisp hooks to deal
> with such corner cases.

There's a tension between avoiding pitfalls and making it inconvenient
for corner cases.

I do think there's a real plain bug here, tho, if you change your
"recipe" to `uft-8` instead `utf-8-with-signature`: take a utf-8 text
file (in a UTF-8 locale), add a NUL byte to it, save, close, and
re-open: you now get a unibyte buffer showing the bytes rather than
the chars.

Emacs should generally try and warn you when saving a file with a coding
system different than the one it would guess when later re-opening the file.
The problem doesn't show up with `utf-8-with-signature` because
apparently the BOM is given more weight than the NUL byte in determining
which coding system to use.

> My opinion is that Emacs should be more helpful here by default.

No point arguing here then: make it a bug report.

`help-gnu-emacs` is rather for the case where you're looking for
a workaround (or when you suspect what you're seeing is a "feature" you
just fail to understand) rather than fixing it "for everyone".

        Stefan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Text copied from *grep* buffer has NUL (0x00) characters
  2021-05-09 18:57       ` Eli Zaretskii
@ 2021-05-09 21:13         ` R. Diez
  2021-05-10  7:10           ` tomas
  0 siblings, 1 reply; 8+ messages in thread
From: R. Diez @ 2021-05-09 21:13 UTC (permalink / raw)
  To: Eli Zaretskii, Stefan Monnier; +Cc: help-gnu-emacs

EZ> That's not the same.  the warning you saw is triggered by a failure to
EZ> convert to the external encoding, so it consumes no extra CPU cycles.

But it could be, from my (admittedly naive) point of view:

(convert-to-external-encoding  but-with-some-extra-flag-to-warn-about-NUL-chars)

EZ> Null bytes will not fail anything, so you should test for them
EZ> explicitly (and in some encodings, like UTF-16, they are necessary and
EZ> cannot be avoided).

I didn't know that about UTF-16, but I could not find any information about it either. Why is a NUL char necessary in UTF-16 and not UTF-8?

Or do you mean that UTF-16 tends to have many interleaved zero bytes? In this case, I would have thought that the problem would be the 16-bit NUL 
character, I mean 0x0000. That is the character to watch out for in UTF-16.

Encodings like UTF-16, that always need more than one byte pro character, are uncommon, won't work with many text editors or tools like 'grep', and 
most people will expect problems with them anyway. So I wouldn't worry too much about them.

The NUL char issue (the unexpected problems I talked about), that you are likely to run into sooner or later, will probably only affect the popular, 
single-byte-oriented formats like ASCII, ISO/IEC 8859-1 and UTF-8.

SM> I do think there's a real plain bug here, tho, if you change your
SM> "recipe" to `uft-8` instead `utf-8-with-signature`: take a utf-8 text
SM> file (in a UTF-8 locale), add a NUL byte to it, save, close, and
SM> re-open: you now get a unibyte buffer showing the bytes rather than
SM> the chars.
SM>
SM> Emacs should generally try and warn you when saving a file with a coding
SM> system different than the one it would guess when later re-opening the file.
SM> The problem doesn't show up with `utf-8-with-signature` because
SM> apparently the BOM is given more weight than the NUL byte in determining
SM> which coding system to use.

Thanks for pointing that out.

That is why I think that NUL may be a valid character, perfectly fine in theory, but it even easily trips up Emacs itself. This is why I would make 
Emacs smarter and warn about it, either on paste, or on save.

There may be one more quirk in this area, because my text file had somehow lost the UTF-8 BOM too, and I only edit it with Emacs.

I cannot invest more time into this issue at the moment. I hope these posts provide enough information if somebody is interested in the future.

Regards,
   rdiez

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Text copied from *grep* buffer has NUL (0x00) characters
  2021-05-09 21:13         ` R. Diez
@ 2021-05-10  7:10           ` tomas
  0 siblings, 0 replies; 8+ messages in thread
From: tomas @ 2021-05-10  7:10 UTC (permalink / raw)
  To: R. Diez; +Cc: help-gnu-emacs, Stefan Monnier

[-- Attachment #1: Type: text/plain, Size: 1675 bytes --]

On Sun, May 09, 2021 at 11:13:36PM +0200, R. Diez wrote:
> 
> EZ> That's not the same.  the warning you saw is triggered by a failure to
> EZ> convert to the external encoding, so it consumes no extra CPU cycles.
> 
> But it could be, from my (admittedly naive) point of view:
> 
> (convert-to-external-encoding  but-with-some-extra-flag-to-warn-about-NUL-chars)
> 
> 
> EZ> Null bytes will not fail anything, so you should test for them
> EZ> explicitly (and in some encodings, like UTF-16, they are necessary and
> EZ> cannot be avoided).
> 
> I didn't know that about UTF-16, but I could not find any information about it either. Why is a NUL char necessary in UTF-16 and not UTF-8?

UTF-16 [1] encodes characters using 16 bit "packets" called "code
units". Like UTF-8, whenever one unit isn't sufficient, you use
more. The bit pattern tells you whether there are more to come.

In the case of UTF-16 "more" is at most two.

For the "small" code points, 8 of those 16 bit are zero. Which one
depends on endiannes, but this or that way, you end up with a lot
of zero bytes in your text. That's how UTF-16BE (big endian) looks
like:

  tomas@trotzki:~$ echo "hello, world" | iconv -f utf-8 -t UTF-16BE | hexdump -C
  00000000  00 68 00 65 00 6c 00 6c  00 6f 00 2c 00 20 00 77  |.h.e.l.l.o.,. .w|
  00000010  00 6f 00 72 00 6c 00 64  00 0a                    |.o.r.l.d..|
  0000001a

... so a bit like Swiss cheese.

UTF-16 needs a BOM (byte order mark) to disambiguate on endianness.
UTF-8 doesn't (is a byte stream), although Microsoftey-applications
tend to sneak one in, just to annoy the rest of us.

Or something.

Cheers
 - t

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-05-10  7:10 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <ddcebd62-1eb7-d24c-9a85-dadeb62c6ea2.ref@yahoo.de>
2021-05-09  9:19 ` Text copied from *grep* buffer has NUL (0x00) characters R. Diez
2021-05-09 10:01   ` Eli Zaretskii
2021-05-09 18:47     ` R. Diez
2021-05-09 18:57       ` Eli Zaretskii
2021-05-09 21:13         ` R. Diez
2021-05-10  7:10           ` tomas
2021-05-09 19:47       ` Stefan Monnier
2021-05-09 13:57   ` Stefan Monnier via Users list for the GNU Emacs text editor

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).