Re: Text copied from *grep* buffer has NUL (0x00) characters

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

From: "R. Diez" <rdiezmail-emacs@yahoo.de>
To: Eli Zaretskii <eliz@gnu.org>, monnier@iro.umontreal.ca
Cc: help-gnu-emacs@gnu.org
Subject: Re: Text copied from *grep* buffer has NUL (0x00) characters
Date: Sun, 9 May 2021 20:47:28 +0200	[thread overview]
Message-ID: <3e892a2e-1d04-7712-d129-e4f59382457b@yahoo.de> (raw)
In-Reply-To: <83bl9k8buk.fsf@gnu.org>

> There's nothing wrong with null bytes in a UTF-8 encoded file, not in
> general.

Well, that's true by the book.

I already mentioned Meld and Pluma. Xfce's text editor, Mousepad, refuses too to open UTF-8 files with BOM if they contain a NUL character.

gedit at least used to have the same problem:
https://superuser.com/questions/246014/use-gedit-to-open-file-with-null-characters

Geany truncates the file at the first NUL.

So it is a problem in practice.

But we could of course insist on everyone switching to a proper text editor when they try to open our UTF-8 files with embedded NULs. That will surely 
make us even more popular... ]8-)

> We could have an optional warning about null bytes (when?
> when you save the buffer?).  But I see no reason to do that by
> default, especially since such a feature would require a costly search
> of the entire buffer.

Some terminal emulators warn when pasting suspicious text.

Emacs is already checking all bytes on save. I inserted an invalid sequence and got this warning on save:

-------------8<-------------8<-------------
These default coding systems were tried to encode text
in the buffer ‘Test3.txt’:
   (utf-8-with-signature-dos (11 . 4194176) (12 . 4194239))
However, each of them encountered characters it couldn’t encode:
   utf-8-with-signature-dos cannot encode these: \200 \277

Click on a character (or switch to this window by ‘C-x o’
and select the characters by RET) to jump to the place it appears,
where ‘M-x universal-argument C-x =’ will give information about it.

Select one of the safe coding systems listed below,
or cancel the writing with C-g and edit the buffer
    to remove or modify the problematic characters,
or specify any other coding system (and risk losing
    the problematic characters).

   raw-text no-conversion
-------------8<-------------8<-------------

Therefore, I don't think it would cost too much to check for NULs at the same time, and give users the choice.

> This is easy to fix: customize the Grep command to not include
> "--null".  That switch is mainly for systems that allow newlines in
> file names, which MS-Windows doesn't allow, so if this switch causes
> trouble in your usage, simply remove it.

I am using Linux. Of course, now that I know what the issue is, I can just remove --null from the grep command and be done with it. That would quietly 
fix the problem for me.

The reason I wrote a long e-mail is to illustrate my head scratching when I got hit several times, because it is not obvious where the problem is 
coming from.

I'll post again if I manage to reproduce a more serious variant of this issue where the file started to show Chinese characters in other editors, 
while Emacs decided to start showing ^M at the end of the lines. My guess is that it was a similar gotcha, because I have been copying from the *grep* 
buffer a few times in the last days.

I believe that this NUL gotcha is going to hit many people, who will then think "this is just another Emacs quirk". After all, the grep --null is a 
relatively recent change in Emacs 26.1 . And many log files have embedded NUL characters too, so you may inadvertently copy NUL characters along.

 > For the detection of NULs in UTF-8 files, you could also ask for such
 > a feature via `M-x report-emacs-bug` but it should be pretty easy to get
 > something comparable with something like:
 > [...]

I don't think it is desirable for users to install such Lisp hooks to deal with such corner cases. My opinion is that Emacs should be more helpful 
here by default. But maybe this mailing list post is enough, if users facing such "corruption" or character encoding problems manage to enter the 
right search terms.

 > This "what you see in NOT what you get" is indeed undesirable.  I'm not
 > sure it's easy to fix in a reliable way in Emacs (beside not using
 > `--null` as Eli points out), but I suggest you `M-x report-emacs-bug`.
 > Maybe grep-mode can add a `filter-buffer-substring-function` that
 > converts those NUL into `:`.

That seems fair. I'll report that as a bug.

Regards,
   rdiez

next prev parent reply	other threads:[~2021-05-09 18:47 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <ddcebd62-1eb7-d24c-9a85-dadeb62c6ea2.ref@yahoo.de>
2021-05-09  9:19 ` Text copied from *grep* buffer has NUL (0x00) characters R. Diez
2021-05-09 10:01   ` Eli Zaretskii
2021-05-09 18:47     ` R. Diez [this message]
2021-05-09 18:57       ` Eli Zaretskii
2021-05-09 21:13         ` R. Diez
2021-05-10  7:10           ` tomas
2021-05-09 19:47       ` Stefan Monnier
2021-05-09 13:57   ` Stefan Monnier via Users list for the GNU Emacs text editor

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3e892a2e-1d04-7712-d129-e4f59382457b@yahoo.de \
    --to=rdiezmail-emacs@yahoo.de \
    --cc=eliz@gnu.org \
    --cc=help-gnu-emacs@gnu.org \
    --cc=monnier@iro.umontreal.ca \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).