all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: "R. Diez" <rdiezmail-emacs@yahoo.de>
To: help-gnu-emacs@gnu.org
Subject: Text copied from *grep* buffer has NUL (0x00) characters
Date: Sun, 9 May 2021 11:19:38 +0200	[thread overview]
Message-ID: <ddcebd62-1eb7-d24c-9a85-dadeb62c6ea2@yahoo.de> (raw)
In-Reply-To: ddcebd62-1eb7-d24c-9a85-dadeb62c6ea2.ref@yahoo.de

Hi all:

I have been using encoding utf-8-with-signature-dos for years with my main notes.txt file, because it is very portable. Even ancient versions of 
Windows Notepad honour the UTF-8 BOM correctly.

Recently, my notes.txt became corrupt a few times. I started seeing ^M characters at the end of each line, and other text editors started complaining 
about invalid UTF-8 sequences inside.

I thought my network connection was unreliable, or maybe my local disk, or Emacs had a bug. Restoring the notes.txt file wasn't easy, because it was 
not obvious what was wrong with it. I couldn't find a command-line tool that would easily replace any invalid UTF-8 sequences with their hex code 
equivalents, but I must admit that I did not actually invest much time looking. After all, I have automated backups.

Yesterday, I remembered exactly what I had done last: I had copied text from the *grep* buffer after using 'rgrep'.

After some investigation, it turns out Emacs' default "Grep Command" is "grep --color -nH --null -e ", which includes option "--null". This means that 
grep is embedding an ASCII NUL character (a binary 0x00) after the filenames.

This is what an rgrep text search occurrence looks like in the *grep* buffer:

./some/file.txt:123:some text line

The first ':' is actually a binary null, but the *grep* buffer hides this fact.

If you copy that text line to an Emacs text file buffer, it then looks like this:

./some/file.txt^@123:some text line

The ^@ is the representation for the binary null. With my preliminary testing, I could not reproduce the kind of text file "corruption" I had seen 
before, but other text editors started complaining again about an invalid UTF-8 sequence or the like.

For example, the MATE Desktop text editor, Pluma, complained about an "incomplete multibyte sequence in input". Pluma refuses to open short files with 
embedded NUL characters because it cannot detect the character encoding, or because it claims that it looks like a binary file. Merge tool 'Meld' also 
complained about invalid characters.

I would say that Emacs has 2 issues here:


1) If a text file encoding is utf-8-with-signature-dos, I do not think that it is a good idea for Emacs to allow binary zeros without any warning.

A character sequence like ^@ is easy to miss in the middle of long text lines, as it is not coloured in red and does not have any other visible hint.

A 0x00 may well be a valid UTF-8 character, but it is probably going to cause problems in many places. This kind of problem is not new, see also 
"modified UTF-8".

I think that I have seen warnings from Emacs before about characters that could not be encoded in the current buffer encoding. I would welcome such a 
warning for binary zeros.


2) Copying text from a *grep* buffer that looks like ":" should not suddenly deliver a NUL character instead. That's just unexpected and prone to 
problems down the line.


Regards,
   rdiez



       reply	other threads:[~2021-05-09  9:19 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <ddcebd62-1eb7-d24c-9a85-dadeb62c6ea2.ref@yahoo.de>
2021-05-09  9:19 ` R. Diez [this message]
2021-05-09 10:01   ` Text copied from *grep* buffer has NUL (0x00) characters Eli Zaretskii
2021-05-09 18:47     ` R. Diez
2021-05-09 18:57       ` Eli Zaretskii
2021-05-09 21:13         ` R. Diez
2021-05-10  7:10           ` tomas
2021-05-09 19:47       ` Stefan Monnier
2021-05-09 13:57   ` Stefan Monnier via Users list for the GNU Emacs text editor

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ddcebd62-1eb7-d24c-9a85-dadeb62c6ea2@yahoo.de \
    --to=rdiezmail-emacs@yahoo.de \
    --cc=help-gnu-emacs@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.