all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: "R. Diez" <rdiezmail-emacs@yahoo.de>
To: Eli Zaretskii <eliz@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca>
Cc: help-gnu-emacs@gnu.org
Subject: Re: Text copied from *grep* buffer has NUL (0x00) characters
Date: Sun, 9 May 2021 23:13:36 +0200	[thread overview]
Message-ID: <634e880a-b43b-8c59-eb4c-b0c07813bb12@yahoo.de> (raw)
In-Reply-To: <83a6p37n15.fsf@gnu.org>


EZ> That's not the same.  the warning you saw is triggered by a failure to
EZ> convert to the external encoding, so it consumes no extra CPU cycles.

But it could be, from my (admittedly naive) point of view:

(convert-to-external-encoding  but-with-some-extra-flag-to-warn-about-NUL-chars)


EZ> Null bytes will not fail anything, so you should test for them
EZ> explicitly (and in some encodings, like UTF-16, they are necessary and
EZ> cannot be avoided).

I didn't know that about UTF-16, but I could not find any information about it either. Why is a NUL char necessary in UTF-16 and not UTF-8?

Or do you mean that UTF-16 tends to have many interleaved zero bytes? In this case, I would have thought that the problem would be the 16-bit NUL 
character, I mean 0x0000. That is the character to watch out for in UTF-16.

Encodings like UTF-16, that always need more than one byte pro character, are uncommon, won't work with many text editors or tools like 'grep', and 
most people will expect problems with them anyway. So I wouldn't worry too much about them.

The NUL char issue (the unexpected problems I talked about), that you are likely to run into sooner or later, will probably only affect the popular, 
single-byte-oriented formats like ASCII, ISO/IEC 8859-1 and UTF-8.


SM> I do think there's a real plain bug here, tho, if you change your
SM> "recipe" to `uft-8` instead `utf-8-with-signature`: take a utf-8 text
SM> file (in a UTF-8 locale), add a NUL byte to it, save, close, and
SM> re-open: you now get a unibyte buffer showing the bytes rather than
SM> the chars.
SM>
SM> Emacs should generally try and warn you when saving a file with a coding
SM> system different than the one it would guess when later re-opening the file.
SM> The problem doesn't show up with `utf-8-with-signature` because
SM> apparently the BOM is given more weight than the NUL byte in determining
SM> which coding system to use.

Thanks for pointing that out.

That is why I think that NUL may be a valid character, perfectly fine in theory, but it even easily trips up Emacs itself. This is why I would make 
Emacs smarter and warn about it, either on paste, or on save.

There may be one more quirk in this area, because my text file had somehow lost the UTF-8 BOM too, and I only edit it with Emacs.

I cannot invest more time into this issue at the moment. I hope these posts provide enough information if somebody is interested in the future.

Regards,
   rdiez



  reply	other threads:[~2021-05-09 21:13 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <ddcebd62-1eb7-d24c-9a85-dadeb62c6ea2.ref@yahoo.de>
2021-05-09  9:19 ` Text copied from *grep* buffer has NUL (0x00) characters R. Diez
2021-05-09 10:01   ` Eli Zaretskii
2021-05-09 18:47     ` R. Diez
2021-05-09 18:57       ` Eli Zaretskii
2021-05-09 21:13         ` R. Diez [this message]
2021-05-10  7:10           ` tomas
2021-05-09 19:47       ` Stefan Monnier
2021-05-09 13:57   ` Stefan Monnier via Users list for the GNU Emacs text editor

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=634e880a-b43b-8c59-eb4c-b0c07813bb12@yahoo.de \
    --to=rdiezmail-emacs@yahoo.de \
    --cc=eliz@gnu.org \
    --cc=help-gnu-emacs@gnu.org \
    --cc=monnier@iro.umontreal.ca \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.