* Re: [Emacs-diffs] emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer [not found] ` <20190525191040.CCD6C207F5@vcs0.savannah.gnu.org> @ 2019-05-25 19:41 ` Stefan Monnier 2019-05-25 19:59 ` Eli Zaretskii 2019-05-27 9:47 ` Robert Pluim 1 sibling, 1 reply; 50+ messages in thread From: Stefan Monnier @ 2019-05-25 19:41 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel > +length of the decoded text. If that buffer is a unibyte buffer > +(@pxref{Selecting a Representations}), the internal representation of > +the decoded text (@pxref{Text Representations}) is inserted into the > +buffer as individual bytes. If the decoded char is a byte between 128-255, is it inserted as a single byte or as the two-byte sequence used internally for those "eight-bit" chars? Stefan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [Emacs-diffs] emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-25 19:41 ` [Emacs-diffs] emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer Stefan Monnier @ 2019-05-25 19:59 ` Eli Zaretskii 2019-05-25 20:15 ` Eli Zaretskii 2019-05-25 21:11 ` Stefan Monnier 0 siblings, 2 replies; 50+ messages in thread From: Eli Zaretskii @ 2019-05-25 19:59 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: emacs-devel@gnu.org > Date: Sat, 25 May 2019 15:41:46 -0400 > > > +length of the decoded text. If that buffer is a unibyte buffer > > +(@pxref{Selecting a Representations}), the internal representation of > > +the decoded text (@pxref{Text Representations}) is inserted into the > > +buffer as individual bytes. > > If the decoded char is a byte between 128-255, is it inserted as > a single byte or as the two-byte sequence used internally for those > "eight-bit" chars? The internal representation of the decoded text could include both. If some of the bytes in the original byte stream couldn't be decoded using the specified coding-system, they will be represented as raw bytes, using 2-byte sequences. OTOH, Latin characters successfully decoded into codepoints less than 256 will take 1 byte. Again, this is just the internal representation of what was decoded. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [Emacs-diffs] emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-25 19:59 ` Eli Zaretskii @ 2019-05-25 20:15 ` Eli Zaretskii 2019-05-25 21:11 ` Stefan Monnier 1 sibling, 0 replies; 50+ messages in thread From: Eli Zaretskii @ 2019-05-25 20:15 UTC (permalink / raw) To: monnier; +Cc: emacs-devel > Date: Sat, 25 May 2019 22:59:02 +0300 > From: Eli Zaretskii <eliz@gnu.org> > Cc: emacs-devel@gnu.org > > The internal representation of the decoded text could include both. > If some of the bytes in the original byte stream couldn't be decoded > using the specified coding-system, they will be represented as raw > bytes, using 2-byte sequences. OTOH, Latin characters successfully > decoded into codepoints less than 256 will take 1 byte. ^^^^^^^^^^^^^ Oops, I meant less than 128. Characters between 128 and 255 will be represented as 2-byte UTF-8 sequences, of course. > Again, this is just the internal representation of what was decoded. Right. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [Emacs-diffs] emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-25 19:59 ` Eli Zaretskii 2019-05-25 20:15 ` Eli Zaretskii @ 2019-05-25 21:11 ` Stefan Monnier 2019-05-25 21:27 ` Stefan Monnier 2019-05-26 2:37 ` Eli Zaretskii 1 sibling, 2 replies; 50+ messages in thread From: Stefan Monnier @ 2019-05-25 21:11 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel > The internal representation of the decoded text could include both. > If some of the bytes in the original byte stream couldn't be decoded > using the specified coding-system, they will be represented as raw > bytes, using 2-byte sequences. OTOH, Latin characters successfully > decoded into codepoints less than 256 will take 1 byte. > Again, this is just the internal representation of what was decoded. Great, thanks. But now I wonder, what can we do with this representation. I guess set-buffer-multibyte will convert it to the intended chars, but that bugs the question "why bother deciding into the unibyte buffer and call set-buffer-multibyte afterwards rather than do the reverse"? Anything else we can do with it? Stefan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [Emacs-diffs] emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-25 21:11 ` Stefan Monnier @ 2019-05-25 21:27 ` Stefan Monnier 2019-05-26 2:37 ` Eli Zaretskii 1 sibling, 0 replies; 50+ messages in thread From: Stefan Monnier @ 2019-05-25 21:27 UTC (permalink / raw) To: emacs-devel > But now I wonder, what can we do with this representation. > I guess set-buffer-multibyte will convert it to the intended chars, but > that bugs the question "why bother deciding into the unibyte buffer and ^^^^ ^^^^^^^^ begs decoding The first typo is probably telling of my state of mind, tho :-) Stefan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [Emacs-diffs] emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-25 21:11 ` Stefan Monnier 2019-05-25 21:27 ` Stefan Monnier @ 2019-05-26 2:37 ` Eli Zaretskii 1 sibling, 0 replies; 50+ messages in thread From: Eli Zaretskii @ 2019-05-26 2:37 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: emacs-devel@gnu.org > Date: Sat, 25 May 2019 17:11:00 -0400 > > But now I wonder, what can we do with this representation. > I guess set-buffer-multibyte will convert it to the intended chars, but > that bugs the question "why bother deciding into the unibyte buffer and > call set-buffer-multibyte afterwards rather than do the reverse"? > Anything else we can do with it? I don't know. It isn't a usual thing to do, to be sure. But it isn't non-sensical, either. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer [not found] ` <20190525191040.CCD6C207F5@vcs0.savannah.gnu.org> 2019-05-25 19:41 ` [Emacs-diffs] emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer Stefan Monnier @ 2019-05-27 9:47 ` Robert Pluim 2019-05-27 12:24 ` Stefan Monnier 2019-05-27 16:40 ` Eli Zaretskii 1 sibling, 2 replies; 50+ messages in thread From: Robert Pluim @ 2019-05-27 9:47 UTC (permalink / raw) To: emacs-devel; +Cc: Eli Zaretskii >>>>> On Sat, 25 May 2019 15:10:40 -0400 (EDT), eliz@gnu.org (Eli Zaretskii) said: Eli> branch: emacs-26 Eli> commit 8f18d121210aa27dc05555140ab21a8489f0de50 Eli> Author: Eli Zaretskii <eliz@gnu.org> Eli> Commit: Eli Zaretskii <eliz@gnu.org> Eli> Improve documentation of decoding into a unibyte buffer Eli> * doc/lispref/nonascii.texi (Explicit Encoding): Document what Eli> happens when DESTINATION of decoding is a unibyte buffer. Eli> * src/coding.c (Fdecode_coding_region) Eli> (Fdecode_coding_string): Document what happens if DESTINATION Eli> is a unibyte buffer. A related issue: C-h f string-as-unibyte string-as-unibyte is a built-in function in `src/fns.c'. (string-as-unibyte STRING) This function is obsolete since 26.1; use `encode-coding-string'. Probably introduced at or before Emacs version 20.3. This function does not change global state, including the match data. Having trawled through the elisp manual, for the life of me itʼs not clear which coding system I should use. 'raw-text'? 'us-ascii'? Something Else? Robert ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-27 9:47 ` Robert Pluim @ 2019-05-27 12:24 ` Stefan Monnier 2019-05-27 13:02 ` Robert Pluim 2019-05-27 16:42 ` Eli Zaretskii 2019-05-27 16:40 ` Eli Zaretskii 1 sibling, 2 replies; 50+ messages in thread From: Stefan Monnier @ 2019-05-27 12:24 UTC (permalink / raw) To: emacs-devel > A related issue: C-h f string-as-unibyte > > string-as-unibyte is a built-in function in `src/fns.c'. > > (string-as-unibyte STRING) > > This function is obsolete since 26.1; > use `encode-coding-string'. > Probably introduced at or before Emacs version 20.3. > This function does not change global state, including the match data. > > Having trawled through the elisp manual, for the life of me itʼs not > clear which coding system I should use. 'raw-text'? 'us-ascii'? > Something Else? The coding that most closely corresponds to what string-as-unibyte does is `emacs-internal`. In 90% of the cases, it's not what you want, tho because the code shouldn't have used string-as-unibyte in the first place, so you'll need to find out what the code *really* needs. Stefan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-27 12:24 ` Stefan Monnier @ 2019-05-27 13:02 ` Robert Pluim 2019-05-27 13:32 ` Stefan Monnier 2019-05-27 16:43 ` Eli Zaretskii 2019-05-27 16:42 ` Eli Zaretskii 1 sibling, 2 replies; 50+ messages in thread From: Robert Pluim @ 2019-05-27 13:02 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel >>>>> On Mon, 27 May 2019 08:24:46 -0400, Stefan Monnier <monnier@iro.umontreal.ca> said: >> A related issue: C-h f string-as-unibyte >> >> string-as-unibyte is a built-in function in `src/fns.c'. >> >> (string-as-unibyte STRING) >> >> This function is obsolete since 26.1; >> use `encode-coding-string'. >> Probably introduced at or before Emacs version 20.3. >> This function does not change global state, including the match data. >> >> Having trawled through the elisp manual, for the life of me itʼs not >> clear which coding system I should use. 'raw-text'? 'us-ascii'? >> Something Else? Stefan> The coding that most closely corresponds to what string-as-unibyte does Stefan> is `emacs-internal`. In 90% of the cases, it's not what you want, tho Stefan> because the code shouldn't have used string-as-unibyte in the Stefan> first place, so you'll need to find out what the code *really* needs. Almost all uses of string-as-unibyte are gone now, but the one I was looking at is this one in international/mule-cmds.el: (defun encoded-string-description (str coding-system) "Return a pretty description of STR that is encoded by CODING-SYSTEM." (setq str (string-as-unibyte str)) (mapconcat (if (and coding-system (eq (coding-system-type coding-system) 'iso-2022)) ;; Try to get a pretty description for ISO 2022 escape sequences. (function (lambda (x) (or (cdr (assq x iso-2022-control-alist)) (format "#x%02X" x)))) (function (lambda (x) (format "#x%02X" x)))) str " ")) If I take a string of say "β", and replace string-as-unibyte with (encode-coding-string 'emacs-internal), `encoded-string-description' prints "#xCE #xB2", which is the correct UTF-8 encoded value. 'raw-text works too. Iʼm certain that there are subtle differences between the two that I donʼt understand. Robert ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-27 13:02 ` Robert Pluim @ 2019-05-27 13:32 ` Stefan Monnier 2019-05-27 13:49 ` Robert Pluim 2019-05-27 16:51 ` Eli Zaretskii 2019-05-27 16:43 ` Eli Zaretskii 1 sibling, 2 replies; 50+ messages in thread From: Stefan Monnier @ 2019-05-27 13:32 UTC (permalink / raw) To: emacs-devel > Almost all uses of string-as-unibyte are gone now, but the one I was > looking at is this one in international/mule-cmds.el: > > (defun encoded-string-description (str coding-system) > "Return a pretty description of STR that is encoded by CODING-SYSTEM." > (setq str (string-as-unibyte str)) > (mapconcat > (if (and coding-system (eq (coding-system-type coding-system) 'iso-2022)) > ;; Try to get a pretty description for ISO 2022 escape sequences. > (function (lambda (x) (or (cdr (assq x iso-2022-control-alist)) > (format "#x%02X" x)))) > (function (lambda (x) (format "#x%02X" x)))) > str " ")) > > If I take a string of say "β", and replace string-as-unibyte with > (encode-coding-string 'emacs-internal), `encoded-string-description' > prints "#xCE #xB2", which is the correct UTF-8 encoded > value. 'raw-text works too. Iʼm certain that there are subtle > differences between the two that I donʼt understand. But "β" is not a "STR that is encoded by CODING-SYSTEM", so this output is neither correct nor incorrect in any case. I think the right thing to do here is one of: - signal an error if `str` is multibyte. - signal an error if `str` is multibyte and contains non-byte chars. - if multibyte, encode `str` with `coding-system`. - just don't bother looking at whether `str` is unibyte or not, just pass it as is to `mapconcat`. - just don't bother looking at whether `str` is unibyte or not, just pass it as is to `mapconcat` but in the lambda, do catch the case where `x` is an "eight bit raw-byte char" and if so pass it to multibyte-char-to-unibyte. - ... But encoding `str` with any coding system like raw-text or emacs-internal doesn't seem to make much sense. Stefan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-27 13:32 ` Stefan Monnier @ 2019-05-27 13:49 ` Robert Pluim 2019-05-27 16:53 ` Eli Zaretskii 2019-05-28 3:08 ` Stefan Monnier 2019-05-27 16:51 ` Eli Zaretskii 1 sibling, 2 replies; 50+ messages in thread From: Robert Pluim @ 2019-05-27 13:49 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel >>>>> On Mon, 27 May 2019 09:32:11 -0400, Stefan Monnier <monnier@iro.umontreal.ca> said: >> If I take a string of say "β", and replace string-as-unibyte with >> (encode-coding-string 'emacs-internal), `encoded-string-description' >> prints "#xCE #xB2", which is the correct UTF-8 encoded >> value. 'raw-text works too. Iʼm certain that there are subtle >> differences between the two that I donʼt understand. Stefan> But "β" is not a "STR that is encoded by CODING-SYSTEM", so this output Stefan> is neither correct nor incorrect in any case. It matches the current output of encoded-string-description, though. Stefan> I think the right thing to do here is one of: Stefan> - signal an error if `str` is multibyte. Stefan> - signal an error if `str` is multibyte and contains non-byte chars. Stefan> - if multibyte, encode `str` with `coding-system`. Stefan> - just don't bother looking at whether `str` is unibyte or not, just Stefan> pass it as is to `mapconcat`. Stefan> - just don't bother looking at whether `str` is unibyte or not, just Stefan> pass it as is to `mapconcat` but in the lambda, do catch the case Stefan> where `x` is an "eight bit raw-byte char" and if so pass it to Stefan> multibyte-char-to-unibyte. Stefan> - ... Since this is the underlying code that displays the 'buffer code' section of 'C-u C-x =', I donʼt think barfing on multibyte is the right thing to do. Nor is passing it on as is. Stefan> But encoding `str` with any coding system like raw-text or Stefan> emacs-internal doesn't seem to make much sense. Then what is the correct way to say 'give me the raw byte version of this character'? (or maybe we should just let sleeping encodings lie :-) ) Robert ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-27 13:49 ` Robert Pluim @ 2019-05-27 16:53 ` Eli Zaretskii 2019-05-28 6:23 ` Robert Pluim 2019-05-28 3:08 ` Stefan Monnier 1 sibling, 1 reply; 50+ messages in thread From: Eli Zaretskii @ 2019-05-27 16:53 UTC (permalink / raw) To: Robert Pluim; +Cc: emacs-devel > From: Robert Pluim <rpluim@gmail.com> > Date: Mon, 27 May 2019 15:49:50 +0200 > Cc: emacs-devel@gnu.org > > Then what is the correct way to say 'give me the raw byte version > of this character'? I suspect that I already answered that question, but if not, please explain what you mean by "the raw byte version of a character". ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-27 16:53 ` Eli Zaretskii @ 2019-05-28 6:23 ` Robert Pluim 2019-05-28 14:57 ` Eli Zaretskii 0 siblings, 1 reply; 50+ messages in thread From: Robert Pluim @ 2019-05-28 6:23 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel >>>>> On Mon, 27 May 2019 19:53:18 +0300, Eli Zaretskii <eliz@gnu.org> said: >> From: Robert Pluim <rpluim@gmail.com> >> Date: Mon, 27 May 2019 15:49:50 +0200 >> Cc: emacs-devel@gnu.org >> >> Then what is the correct way to say 'give me the raw byte version >> of this character'? Eli> I suspect that I already answered that question, but if not, please Eli> explain what you mean by "the raw byte version of a character". Yes, you did. 'the bytes corresponding to the internal representation' is what I was looking for. (this whole area is pretty easy to get confused about, especially with the various *-unibyte functions. Perhaps I should just pretend they donʼt exist). Robert ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-28 6:23 ` Robert Pluim @ 2019-05-28 14:57 ` Eli Zaretskii 0 siblings, 0 replies; 50+ messages in thread From: Eli Zaretskii @ 2019-05-28 14:57 UTC (permalink / raw) To: Robert Pluim; +Cc: emacs-devel > From: Robert Pluim <rpluim@gmail.com> > Cc: emacs-devel@gnu.org > Date: Tue, 28 May 2019 08:23:37 +0200 > > (this whole area is pretty easy to get confused about, especially with > the various *-unibyte functions. Perhaps I should just pretend they > donʼt exist). Welcome to the club. I myself _never_ remember what each one of them does, so every time they pop up in discussions or in code, I need to look up their code to figure that out anew. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-27 13:49 ` Robert Pluim 2019-05-27 16:53 ` Eli Zaretskii @ 2019-05-28 3:08 ` Stefan Monnier 2019-05-28 4:40 ` Eli Zaretskii 1 sibling, 1 reply; 50+ messages in thread From: Stefan Monnier @ 2019-05-28 3:08 UTC (permalink / raw) To: emacs-devel > Since this is the underlying code that displays the 'buffer code' > section of 'C-u C-x =', I donʼt think barfing on multibyte is the > right thing to do. Nor is passing it on as is. grep gives me: % grep --color -nH --null -e encoded-string-description **/*.el descr-text.el\0350: (encoded-string-description encoded coding))))) descr-text.el\0645: (encoded-string-description descr-text.el\0655: (list (encoded-string-description encoded coding) international/mule-cmds.el\02900:(defun encoded-string-description (str coding-system) simple.el\01481: (encoded-string-description encoded coding))) and according to my quick investigation, all callers pass a string that's coming straight from encode-coding-string or encode-coding-char, so the argument should *always* be unibyte, AFAICT. Hence according to my reading of the code, this call to string-as-unibyte will always just return its argument unchanged. Stefan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-28 3:08 ` Stefan Monnier @ 2019-05-28 4:40 ` Eli Zaretskii 2019-05-28 11:55 ` Stefan Monnier 0 siblings, 1 reply; 50+ messages in thread From: Eli Zaretskii @ 2019-05-28 4:40 UTC (permalink / raw) To: emacs-devel, Stefan Monnier On May 28, 2019 6:08:36 AM GMT+03:00, Stefan Monnier <monnier@iro.umontreal.ca> wrote: > and according to my quick investigation, all callers pass a string > that's coming straight from encode-coding-string or > encode-coding-char, > so the argument should *always* be unibyte, AFAICT. > > Hence according to my reading of the code, this call to > string-as-unibyte will always just return its argument unchanged. That's not entirely true, because encode-coding-char can return a multibyte string. Depending on what you wanted to suggest based on your conclusion, that factoid may or may not be important. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-28 4:40 ` Eli Zaretskii @ 2019-05-28 11:55 ` Stefan Monnier 2019-05-28 15:18 ` Eli Zaretskii 0 siblings, 1 reply; 50+ messages in thread From: Stefan Monnier @ 2019-05-28 11:55 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel >> and according to my quick investigation, all callers pass a string >> that's coming straight from encode-coding-string or >> encode-coding-char, >> so the argument should *always* be unibyte, AFAICT. >> >> Hence according to my reading of the code, this call to >> string-as-unibyte will always just return its argument unchanged. > That's not entirely true, because encode-coding-char can return a multibyte > string. That's weird. When would that happen? Stefan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-28 11:55 ` Stefan Monnier @ 2019-05-28 15:18 ` Eli Zaretskii 2019-05-28 17:43 ` Stefan Monnier 0 siblings, 1 reply; 50+ messages in thread From: Eli Zaretskii @ 2019-05-28 15:18 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: emacs-devel@gnu.org > Date: Tue, 28 May 2019 07:55:40 -0400 > > >> Hence according to my reading of the code, this call to > >> string-as-unibyte will always just return its argument unchanged. > > That's not entirely true, because encode-coding-char can return a multibyte > > string. > > That's weird. When would that happen? "Use the source, Luke!" (let* ((str1 (string-as-multibyte (string char))) (str2 (string-as-multibyte (string char char))) (found (find-coding-systems-string str1)) enc1 enc2 i1 i2) (if (and (consp found) (eq (car found) 'undecided)) str1 <<<<<<<<<<<<<<<<<<<<<<<<< If we return here, the value is str1, which is a multibyte string, see how it was calculated. The easiest use case is this: (multibyte-string-p (encode-coding-char ?a 'utf-8)) => t I didn't think enough about this to figure out if there can be less trivial use cases. If you can describe all the cases where find-coding-systems-string will return a list whose 'car' is 'undecided', my hat off to you. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-28 15:18 ` Eli Zaretskii @ 2019-05-28 17:43 ` Stefan Monnier 2019-05-28 18:58 ` Eli Zaretskii 0 siblings, 1 reply; 50+ messages in thread From: Stefan Monnier @ 2019-05-28 17:43 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel > "Use the source, Luke!" But the dark side is so enticing! > (let* ((str1 (string-as-multibyte (string char))) > (str2 (string-as-multibyte (string char char))) Why on earth do we call string-as-multibyte here? AFAIK, the only cases where `string` returns a unibyte string is when char <128 (it could make sense to also do that for char ≥128 and <160, but we don't seem to do that currently) and these are better turned into multibyte via string-TO-unibyte (tho here we don't even need that, since the unibyte string works just as well for what we do) than string-AS-unibyte. I think this is an error. The patch below seems in order. > (found (find-coding-systems-string str1)) > enc1 enc2 i1 i2) > (if (and (consp found) > (eq (car found) 'undecided)) > str1 <<<<<<<<<<<<<<<<<<<<<<<<< > > If we return here, the value is str1, which is a multibyte string, see > how it was calculated. I think it's a bug. Largely harmless since it only applies to ASCII chars for which we conflate the char/byte status, but still, it's a wart. > I didn't think enough about this to figure out if there can be less > trivial use cases. If you can describe all the cases where > find-coding-systems-string will return a list whose 'car' is > 'undecided', my hat off to you. AFAIK it only happens for pure-ASCII strings. Stefan diff --git a/lisp/international/mule-cmds.el b/lisp/international/mule-cmds.el index 2b0aaca664..391efbedc8 100644 --- a/lisp/international/mule-cmds.el +++ b/lisp/international/mule-cmds.el @@ -2926,12 +2926,11 @@ encode-coding-char If CODING-SYSTEM can't safely encode CHAR, return nil. The 3rd optional argument CHARSET, if non-nil, is a charset preferred on encoding." - (let* ((str1 (string-as-multibyte (string char))) - (str2 (string-as-multibyte (string char char))) + (let* ((str1 (string char)) + (str2 (string char char)) (found (find-coding-systems-string str1)) enc1 enc2 i1 i2) - (if (and (consp found) - (eq (car found) 'undecided)) + (if (not (multibyte-string-p str1)) str1 (when (memq (coding-system-base coding-system) found) ;; We must find the encoded string of CHAR. But, just encoding ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-28 17:43 ` Stefan Monnier @ 2019-05-28 18:58 ` Eli Zaretskii 2019-05-28 19:35 ` Eli Zaretskii 2019-05-28 23:44 ` Stefan Monnier 0 siblings, 2 replies; 50+ messages in thread From: Eli Zaretskii @ 2019-05-28 18:58 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: emacs-devel@gnu.org > Date: Tue, 28 May 2019 13:43:47 -0400 > > > (let* ((str1 (string-as-multibyte (string char))) > > (str2 (string-as-multibyte (string char char))) > > Why on earth do we call string-as-multibyte here? AFAIK, the only cases > where `string` returns a unibyte string is when char <128 (it could make > sense to also do that for char ≥128 and <160, but we don't seem to do > that currently) and these are better turned into multibyte via > string-TO-unibyte (tho here we don't even need that, since the unibyte > string works just as well for what we do) than string-AS-unibyte. > > I think this is an error. The patch below seems in order. I'm not sure. Be sure to read the comments about the tricky business of this function, and the method it employs to solve it, and be sure you understand all of the subtleties there. > > I didn't think enough about this to figure out if there can be less > > trivial use cases. If you can describe all the cases where > > find-coding-systems-string will return a list whose 'car' is > > 'undecided', my hat off to you. > > AFAIK it only happens for pure-ASCII strings. What is your reasoning? ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-28 18:58 ` Eli Zaretskii @ 2019-05-28 19:35 ` Eli Zaretskii 2019-05-28 23:44 ` Stefan Monnier 1 sibling, 0 replies; 50+ messages in thread From: Eli Zaretskii @ 2019-05-28 19:35 UTC (permalink / raw) To: monnier; +Cc: emacs-devel > Date: Tue, 28 May 2019 21:58:03 +0300 > From: Eli Zaretskii <eliz@gnu.org> > Cc: emacs-devel@gnu.org > > > From: Stefan Monnier <monnier@iro.umontreal.ca> > > Cc: emacs-devel@gnu.org > > Date: Tue, 28 May 2019 13:43:47 -0400 > > > > > (let* ((str1 (string-as-multibyte (string char))) > > > (str2 (string-as-multibyte (string char char))) > > > > Why on earth do we call string-as-multibyte here? AFAIK, the only cases > > where `string` returns a unibyte string is when char <128 (it could make > > sense to also do that for char ≥128 and <160, but we don't seem to do > > that currently) and these are better turned into multibyte via > > string-TO-unibyte (tho here we don't even need that, since the unibyte > > string works just as well for what we do) than string-AS-unibyte. > > > > I think this is an error. The patch below seems in order. > > I'm not sure. Be sure to read the comments about the tricky business > of this function, and the method it employs to solve it, and be sure > you understand all of the subtleties there. Btw, that function has a bug: (encode-coding-char ?a 'ebcdic-us) => "a" It assumes that any ASCII character is encoded into itself, i.e. that every coding-system is ASCII-compatible, which is false. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-28 18:58 ` Eli Zaretskii 2019-05-28 19:35 ` Eli Zaretskii @ 2019-05-28 23:44 ` Stefan Monnier 2019-05-29 14:33 ` Eli Zaretskii 1 sibling, 1 reply; 50+ messages in thread From: Stefan Monnier @ 2019-05-28 23:44 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel >> I think this is an error. The patch below seems in order. > > I'm not sure. Be sure to read the comments about the tricky business > of this function, and the method it employs to solve it, and be sure > you understand all of the subtleties there. This only applies to the case where `char` is not ASCII. I installed a slightly more conservative patch which should make sure the returned string is always unibyte and that also fixes the ebcdic case at the same occasion. >> AFAIK it only happens for pure-ASCII strings. > What is your reasoning? For one, the docstring says that, pretty much. But also the fact that `undecided` implies that any coding system should be applicable, IOW `char` is in the intersection of all the coding systems we have, so it can only happen if the string is pure ASCII (since one of the coding systems is `us-ascii`, the insection cannot be larger than that. That doesn't preclude a non-undefined return value for some pure ASCII strings, admittedly, tho the docstring suggests that any ASCII string just returns `undecided`). Stefan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-28 23:44 ` Stefan Monnier @ 2019-05-29 14:33 ` Eli Zaretskii 0 siblings, 0 replies; 50+ messages in thread From: Eli Zaretskii @ 2019-05-29 14:33 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: emacs-devel@gnu.org > Date: Tue, 28 May 2019 19:44:15 -0400 > > I installed a slightly more conservative patch Ugh! why unilaterally? why not show the patch and wait for agreement? what's the rush? ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-27 13:32 ` Stefan Monnier 2019-05-27 13:49 ` Robert Pluim @ 2019-05-27 16:51 ` Eli Zaretskii 2019-05-27 19:17 ` Stefan Monnier 1 sibling, 1 reply; 50+ messages in thread From: Eli Zaretskii @ 2019-05-27 16:51 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Date: Mon, 27 May 2019 09:32:11 -0400 > > encoding `str` with any coding system like raw-text or > emacs-internal doesn't seem to make much sense. For a multibyte string that was encoded already, encoding by utf-8-emacs-unix is IMO the _only_ thing that makes sense. That's how you convert raw bytes in their internal representation to the single-byte external representation you want to see in the output. Also note that encoded-string-description supports codepoints outside the Unicode space, and utf-8-emacs-unix is the only sane way to produce the overlong UTF-8 sequences in those cases. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-27 16:51 ` Eli Zaretskii @ 2019-05-27 19:17 ` Stefan Monnier 2019-05-28 2:30 ` Eli Zaretskii 0 siblings, 1 reply; 50+ messages in thread From: Stefan Monnier @ 2019-05-27 19:17 UTC (permalink / raw) To: emacs-devel > For a multibyte string that was encoded already, encoding by > utf-8-emacs-unix is IMO the _only_ thing that makes sense. I disagree: a multibyte string that was encoded already should only contain ASCII and "eight-bit raw byte chars" (and BTW, would be much better represented as a unibyte string to start with) so it can be converted with string-to-unibyte. Stefan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-27 19:17 ` Stefan Monnier @ 2019-05-28 2:30 ` Eli Zaretskii 2019-05-28 2:56 ` Stefan Monnier 0 siblings, 1 reply; 50+ messages in thread From: Eli Zaretskii @ 2019-05-28 2:30 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Date: Mon, 27 May 2019 15:17:12 -0400 > > > For a multibyte string that was encoded already, encoding by > > utf-8-emacs-unix is IMO the _only_ thing that makes sense. > > I disagree: a multibyte string that was encoded already should only > contain ASCII and "eight-bit raw byte chars" But both ASCII and raw bytes have multibyte representation. If the destination of the encoding is a multibyte buffer, that is what you get there, and then taking buffer-substring will give you a multibyte string with encoded text. I don't see why we shouldn't support this scenario. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-28 2:30 ` Eli Zaretskii @ 2019-05-28 2:56 ` Stefan Monnier 2019-05-28 4:17 ` Eli Zaretskii 0 siblings, 1 reply; 50+ messages in thread From: Stefan Monnier @ 2019-05-28 2:56 UTC (permalink / raw) To: emacs-devel >> > For a multibyte string that was encoded already, encoding by >> > utf-8-emacs-unix is IMO the _only_ thing that makes sense. >> I disagree: a multibyte string that was encoded already should only >> contain ASCII and "eight-bit raw byte chars" > But both ASCII and raw bytes have multibyte representation. Not sure why you say "but" here: I was also talking abut a multibyte string, so I obviously agree. > If the destination of the encoding is a multibyte buffer, that is what > you get there, and then taking buffer-substring will give you > a multibyte string with encoded text. I don't see why we shouldn't > support this scenario. I'm not sure where you read that I was arguing we shouldn't support this scenario. I was just pointing out that utf-8-emacs-unix is not the only thing that makes sense: string-to-unibyte should work just as well (if not better). Stefan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-28 2:56 ` Stefan Monnier @ 2019-05-28 4:17 ` Eli Zaretskii 2019-05-28 6:21 ` Robert Pluim 2019-05-28 11:54 ` Stefan Monnier 0 siblings, 2 replies; 50+ messages in thread From: Eli Zaretskii @ 2019-05-28 4:17 UTC (permalink / raw) To: emacs-devel, Stefan Monnier On May 28, 2019 5:56:18 AM GMT+03:00, Stefan Monnier <monnier@iro.umontreal.ca> wrote: > >> > For a multibyte string that was encoded already, encoding by > >> > utf-8-emacs-unix is IMO the _only_ thing that makes sense. > >> I disagree: a multibyte string that was encoded already should only > >> contain ASCII and "eight-bit raw byte chars" > > But both ASCII and raw bytes have multibyte representation. > > Not sure why you say "but" here: I was also talking abut a multibyte > string, so I obviously agree. > > > If the destination of the encoding is a multibyte buffer, that is > what > > you get there, and then taking buffer-substring will give you > > a multibyte string with encoded text. I don't see why we shouldn't > > support this scenario. > > I'm not sure where you read that I was arguing we shouldn't support > this scenario. I was just pointing out that utf-8-emacs-unix is not > the > only thing that makes sense: string-to-unibyte should work just as > well > (if not better). > > > Stefan string-to-unibyte is also marked obsolete. Robert was asking how to use encoding functions instead of the obsolete string-as-unibyte; advising to use yet another obsolete function makes little sense to me. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-28 4:17 ` Eli Zaretskii @ 2019-05-28 6:21 ` Robert Pluim 2019-05-28 11:53 ` Stefan Monnier 2019-05-28 11:54 ` Stefan Monnier 1 sibling, 1 reply; 50+ messages in thread From: Robert Pluim @ 2019-05-28 6:21 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Stefan Monnier, emacs-devel >>>>> On Tue, 28 May 2019 07:17:35 +0300, Eli Zaretskii <eliz@gnu.org> said: >> I'm not sure where you read that I was arguing we shouldn't support >> this scenario. I was just pointing out that utf-8-emacs-unix is not >> the >> only thing that makes sense: string-to-unibyte should work just as >> well >> (if not better). >> >> >> Stefan Eli> string-to-unibyte is also marked obsolete. Robert was asking how to Eli> use encoding functions instead of the obsolete string-as-unibyte; Eli> advising to use yet another obsolete function makes little sense to Eli> me. Plus calling string-to-unibyte on a multi-byte string will signal an error. (encode-coding-string str 'emacs-internal) seems to be what I want, as Eli pointed out upthread. Robert ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-28 6:21 ` Robert Pluim @ 2019-05-28 11:53 ` Stefan Monnier 0 siblings, 0 replies; 50+ messages in thread From: Stefan Monnier @ 2019-05-28 11:53 UTC (permalink / raw) To: emacs-devel > Plus calling string-to-unibyte on a multi-byte string will signal an > error. Only if that string was not encoded. Which means it's good because it will tell you when you called encoded-string-description incorrectly rather than silently outputting garbage. Stefan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-28 4:17 ` Eli Zaretskii 2019-05-28 6:21 ` Robert Pluim @ 2019-05-28 11:54 ` Stefan Monnier 2019-05-28 15:11 ` Eli Zaretskii 1 sibling, 1 reply; 50+ messages in thread From: Stefan Monnier @ 2019-05-28 11:54 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel > string-to-unibyte is also marked obsolete. Not any more: ;; We used to declare string-to-unibyte obsolete, but it is a valid ;; (make-obsolete 'string-to-unibyte "use `encode-coding-string'." "26.1") Stefan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-28 11:54 ` Stefan Monnier @ 2019-05-28 15:11 ` Eli Zaretskii 2019-05-28 17:25 ` Stefan Monnier 0 siblings, 1 reply; 50+ messages in thread From: Eli Zaretskii @ 2019-05-28 15:11 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: emacs-devel@gnu.org > Date: Tue, 28 May 2019 07:54:41 -0400 > > > string-to-unibyte is also marked obsolete. > > Not any more: > > ;; We used to declare string-to-unibyte obsolete, but it is a valid > ;; (make-obsolete 'string-to-unibyte "use `encode-coding-string'." "26.1") OK, but how does this affect the issue at hand? We want to replace string-as-unibyte, or remove it, and the obsolescence message says to replace it with encode-coding-string. So if we cannot remove it for some reason, it makes more sense to use encode-coding-string. Either way, using string-to-unibyte instead sounds less desirable to me. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-28 15:11 ` Eli Zaretskii @ 2019-05-28 17:25 ` Stefan Monnier 2019-05-28 18:51 ` Eli Zaretskii 0 siblings, 1 reply; 50+ messages in thread From: Stefan Monnier @ 2019-05-28 17:25 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel > OK, but how does this affect the issue at hand? We want to replace > string-as-unibyte, or remove it, and the obsolescence message says to > replace it with encode-coding-string. But this obsolescence message assumes that the call to string-as-unibyte was used because of a need to encode the strings using Emacs's internal coding-system. In this case, the string is already encoded (as stated by the function's name, the docstring, ...), so using encode-coding-string is rather odd. Also using string-to-unibyte will correctly signal an error if the caller forgot to send an *encoded* string. > Either way, using string-to-unibyte instead sounds less desirable > to me. I agree that removing the call altogether is the better option. Stefan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-28 17:25 ` Stefan Monnier @ 2019-05-28 18:51 ` Eli Zaretskii 2019-05-28 23:39 ` Stefan Monnier 0 siblings, 1 reply; 50+ messages in thread From: Eli Zaretskii @ 2019-05-28 18:51 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: emacs-devel@gnu.org > Date: Tue, 28 May 2019 13:25:17 -0400 > > > OK, but how does this affect the issue at hand? We want to replace > > string-as-unibyte, or remove it, and the obsolescence message says to > > replace it with encode-coding-string. > > But this obsolescence message assumes that the call to string-as-unibyte > was used because of a need to encode the strings using Emacs's internal > coding-system. In this case, the string is already encoded (as stated > by the function's name, the docstring, ...), so using > encode-coding-string is rather odd. If the input string is unibyte, then using string-to-unibyte will be also odd. And if it's multibyte, the using encode-coding-string is not really odd, is it? > I agree that removing the call altogether is the better option. Right. In that case we need to document that the function expects as input either a unibyte string or a pure-ASCII string. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-28 18:51 ` Eli Zaretskii @ 2019-05-28 23:39 ` Stefan Monnier 2019-05-29 2:45 ` Eli Zaretskii 0 siblings, 1 reply; 50+ messages in thread From: Stefan Monnier @ 2019-05-28 23:39 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel > If the input string is unibyte, then using string-to-unibyte will be > also odd. And if it's multibyte, the using encode-coding-string is > not really odd, is it? It's odd that the input string be multibyte since it's supposed to be encoded, yes. And it's also odd to call encode-coding-string on a string that we assume to be encoded (just because the string is multibyte doesn't make it less odd). In any case, I think we should strive to avoid using "encoded" multibyte strings. I can't remember ever having had a need for those, but when needed the way to turn a unibyte string into an equivalent multibyte string (without changing the fact that it's encoded) is string-to-multibyte. >> I agree that removing the call altogether is the better option. > > Right. In that case we need to document that the function expects as > input either a unibyte string or a pure-ASCII string. OK, I'll do that. Stefan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-28 23:39 ` Stefan Monnier @ 2019-05-29 2:45 ` Eli Zaretskii 2019-05-29 16:28 ` Stefan Monnier 0 siblings, 1 reply; 50+ messages in thread From: Eli Zaretskii @ 2019-05-29 2:45 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: emacs-devel@gnu.org > Date: Tue, 28 May 2019 19:39:15 -0400 > > In any case, I think we should strive to avoid using "encoded" multibyte > strings. I don't think it's possible, because buffers are by default multibyte. (And I don't really see a need for avoiding that, but that's an old disagreement between us.) ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-29 2:45 ` Eli Zaretskii @ 2019-05-29 16:28 ` Stefan Monnier 2019-05-29 18:19 ` Eli Zaretskii 0 siblings, 1 reply; 50+ messages in thread From: Stefan Monnier @ 2019-05-29 16:28 UTC (permalink / raw) To: emacs-devel >> In any case, I think we should strive to avoid using "encoded" multibyte >> strings. > I don't think it's possible, "strive to avoid" is always possible. I didn't say we should completely disallow it (which might be possible, but it's too far from where we are to be able to tell). > because buffers are by default multibyte. And those contains chars 99,99% of the time. And buffers that contain bytes are unibyte in most cases. This is the sane way to work. It makes it easy to know what is what. Also, not only it's possible, but it's pretty much the case already. Whether we'll be able to eliminate all cases, I don't know. But I think we should try to make the cases of "decoded text in unibyte" and "encoded text in multibyte" as rare as possible. [ Similarly, set-buffer-multibyte should only ever be called in an empty buffer. ] Stefan PS: I added checks in encoding/decoding functions to signal errors when decoding from multibyte and encoding from unibyte (in my local Emacs), and that's been tremendously useful to track down and fix encoding bugs in Gnus. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-29 16:28 ` Stefan Monnier @ 2019-05-29 18:19 ` Eli Zaretskii 2019-05-29 18:58 ` Stefan Monnier 0 siblings, 1 reply; 50+ messages in thread From: Eli Zaretskii @ 2019-05-29 18:19 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Date: Wed, 29 May 2019 12:28:30 -0400 > > PS: I added checks in encoding/decoding functions to signal errors when > decoding from multibyte and encoding from unibyte (in my local Emacs), > and that's been tremendously useful to track down and fix encoding bugs > in Gnus. I have nothing against you making these changes locally to catch bugs. I sometimes do similar things in my locale version of the code. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-29 18:19 ` Eli Zaretskii @ 2019-05-29 18:58 ` Stefan Monnier 2019-05-29 19:09 ` Eli Zaretskii 0 siblings, 1 reply; 50+ messages in thread From: Stefan Monnier @ 2019-05-29 18:58 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel > > I don't think there was much to discuss about it. > Unilaterally. Yes, it's was a clear and simple bug-fix, AFAIK. >> PS: I added checks in encoding/decoding functions to signal errors when >> decoding from multibyte and encoding from unibyte (in my local Emacs), >> and that's been tremendously useful to track down and fix encoding bugs >> in Gnus. > I have nothing against you making these changes locally to catch > bugs. I sometimes do similar things in my locale version of the code. Yes, and this experience showed me that it's a good practice to follow (I don't mean "adding checks" but "align unibyte/multibyte with encoded/decoded"). BTW, I think we should add such checks in `master` (but make them conditional on a variable like `check-encoding/decoding-strictly-for-debugging-purposes` which would of course default to nil). Stefan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-29 18:58 ` Stefan Monnier @ 2019-05-29 19:09 ` Eli Zaretskii 2019-05-29 19:50 ` Stefan Monnier 0 siblings, 1 reply; 50+ messages in thread From: Eli Zaretskii @ 2019-05-29 19:09 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: emacs-devel@gnu.org > Date: Wed, 29 May 2019 14:58:57 -0400 > > > > I don't think there was much to discuss about it. > > Unilaterally. > > Yes, it's was a clear and simple bug-fix, AFAIK. We were in the middle of discussing what should be done, so fairness would have it that a patch should be proposed, not pushed. Or so I thought, turns out naïvely. > > I have nothing against you making these changes locally to catch > > bugs. I sometimes do similar things in my locale version of the code. > > Yes, and this experience showed me that it's a good practice to follow > (I don't mean "adding checks" but "align unibyte/multibyte with > encoded/decoded"). > > BTW, I think we should add such checks in `master` (but make them > conditional on a variable like > `check-encoding/decoding-strictly-for-debugging-purposes` which would > of course default to nil). I'm okay with enabling such checks in an Emacs configured with '--enable-checking', but not in the production version. The main purpose of a released Emacs is not to expose bugs, it's to allow users do whatever they need to do in Emacs, even if it means tolerating some bugs. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-29 19:09 ` Eli Zaretskii @ 2019-05-29 19:50 ` Stefan Monnier 0 siblings, 0 replies; 50+ messages in thread From: Stefan Monnier @ 2019-05-29 19:50 UTC (permalink / raw) To: emacs-devel > I'm okay with enabling such checks in an Emacs configured with > '--enable-checking', but not in the production version. The main > purpose of a released Emacs is not to expose bugs, it's to allow users > do whatever they need to do in Emacs, even if it means tolerating some > bugs. Yes, as mentioned the default would be nil: there's no point signaling such errors to an end-user that hasn't specifically asked for such sanity checks. I hadn't (and still haven't) considered enabling those checks when --enable-checking is specified: it might turn out too many false positives currently. But I'll keep it in mind. Stefan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-27 13:02 ` Robert Pluim 2019-05-27 13:32 ` Stefan Monnier @ 2019-05-27 16:43 ` Eli Zaretskii 1 sibling, 0 replies; 50+ messages in thread From: Eli Zaretskii @ 2019-05-27 16:43 UTC (permalink / raw) To: Robert Pluim; +Cc: emacs-devel > From: Robert Pluim <rpluim@gmail.com> > Date: Mon, 27 May 2019 15:02:42 +0200 > Cc: emacs-devel@gnu.org > > If I take a string of say "β", and replace string-as-unibyte with > (encode-coding-string 'emacs-internal), `encoded-string-description' > prints "#xCE #xB2", which is the correct UTF-8 encoded > value. emacs-internal produces the internal representation, which is a superset of UTF-8. It doesn't produce a 100% valid UTF-8. That's an important distinction that should be always kept in mind. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-27 12:24 ` Stefan Monnier 2019-05-27 13:02 ` Robert Pluim @ 2019-05-27 16:42 ` Eli Zaretskii 2019-05-27 19:13 ` Stefan Monnier 1 sibling, 1 reply; 50+ messages in thread From: Eli Zaretskii @ 2019-05-27 16:42 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Date: Mon, 27 May 2019 08:24:46 -0400 > > > (string-as-unibyte STRING) > > > > This function is obsolete since 26.1; > > use `encode-coding-string'. > > Probably introduced at or before Emacs version 20.3. > > This function does not change global state, including the match data. > > > > Having trawled through the elisp manual, for the life of me itʼs not > > clear which coding system I should use. 'raw-text'? 'us-ascii'? > > Something Else? > > The coding that most closely corresponds to what string-as-unibyte does > is `emacs-internal`. Why "most closely"? Maybe I'm missing something, but I think it corresponds exactly. (I actually prefer using utf-8-emacs-unix, since emacs-internal is an alias, and deceptively doesn't mention its EOL conversion.) ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-27 16:42 ` Eli Zaretskii @ 2019-05-27 19:13 ` Stefan Monnier 0 siblings, 0 replies; 50+ messages in thread From: Stefan Monnier @ 2019-05-27 19:13 UTC (permalink / raw) To: emacs-devel > Why "most closely"? Maybe I'm missing something, but I think it > corresponds exactly. I'm never sure if it corresponds 100%. > (I actually prefer using utf-8-emacs-unix, since emacs-internal is an > alias, and deceptively doesn't mention its EOL conversion.) `emacs-internal` works both in Emacs-22 and Emacs-25 (and presumably in Emacs-36 where we'll be using the new quantum-character encoding). IOW it says precisely what we want: the internal encoding used in the current version of Emacs. Whereas utf-8-emacs-unix reads to me as "some Emacs-specific encoding related to utf-8", so the intention is less clear. But either way works, it's a matter of taste, Stefan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-27 9:47 ` Robert Pluim 2019-05-27 12:24 ` Stefan Monnier @ 2019-05-27 16:40 ` Eli Zaretskii 2019-05-27 20:17 ` Richard Stallman 1 sibling, 1 reply; 50+ messages in thread From: Eli Zaretskii @ 2019-05-27 16:40 UTC (permalink / raw) To: Robert Pluim; +Cc: emacs-devel > From: Robert Pluim <rpluim@gmail.com> > Cc: Eli Zaretskii <eliz@gnu.org> > Date: Mon, 27 May 2019 11:47:35 +0200 > > string-as-unibyte is a built-in function in `src/fns.c'. > > (string-as-unibyte STRING) > > This function is obsolete since 26.1; > use `encode-coding-string'. > Probably introduced at or before Emacs version 20.3. > This function does not change global state, including the match data. > > Having trawled through the elisp manual, for the life of me itʼs not > clear which coding system I should use. 'raw-text'? 'us-ascii'? > Something Else? You should use utf-8-emacs-unix. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-27 16:40 ` Eli Zaretskii @ 2019-05-27 20:17 ` Richard Stallman 2019-05-28 2:36 ` Eli Zaretskii 0 siblings, 1 reply; 50+ messages in thread From: Richard Stallman @ 2019-05-27 20:17 UTC (permalink / raw) To: Eli Zaretskii; +Cc: rpluim, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > Having trawled through the elisp manual, for the life of me itʼs not > > clear which coding system I should use. 'raw-text'? 'us-ascii'? > > Something Else? > You should use utf-8-emacs-unix. Have we got clear documentation of how raw-text and utf-8-emacs-unix differ and when to use each one? Does the Emacs Lisp manual refer to that documention from the places it would be useful to do so? -- Dr Richard Stallman President, Free Software Foundation (https://gnu.org, https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-27 20:17 ` Richard Stallman @ 2019-05-28 2:36 ` Eli Zaretskii 2019-05-28 7:06 ` Robert Pluim 0 siblings, 1 reply; 50+ messages in thread From: Eli Zaretskii @ 2019-05-28 2:36 UTC (permalink / raw) To: rms; +Cc: rpluim, emacs-devel > From: Richard Stallman <rms@gnu.org> > Cc: rpluim@gmail.com, emacs-devel@gnu.org > Date: Mon, 27 May 2019 16:17:31 -0400 > > > > Having trawled through the elisp manual, for the life of me itʼs not > > > clear which coding system I should use. 'raw-text'? 'us-ascii'? > > > Something Else? > > > You should use utf-8-emacs-unix. > > Have we got clear documentation of how raw-text and utf-8-emacs-unix differ > and when to use each one? > > Does the Emacs Lisp manual refer to that documention from the places > it would be useful to do so? We have these two documented. I'm not the right person to say whether it's clear enough, people should free to complain about unclear aspects and missing references. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-28 2:36 ` Eli Zaretskii @ 2019-05-28 7:06 ` Robert Pluim 2019-05-28 14:59 ` Eli Zaretskii 0 siblings, 1 reply; 50+ messages in thread From: Robert Pluim @ 2019-05-28 7:06 UTC (permalink / raw) To: Eli Zaretskii; +Cc: rms, emacs-devel >>>>> On Tue, 28 May 2019 05:36:42 +0300, Eli Zaretskii <eliz@gnu.org> said: >> From: Richard Stallman <rms@gnu.org> >> Cc: rpluim@gmail.com, emacs-devel@gnu.org >> Date: Mon, 27 May 2019 16:17:31 -0400 >> >> > > Having trawled through the elisp manual, for the life of me itʼs not >> > > clear which coding system I should use. 'raw-text'? 'us-ascii'? >> > > Something Else? >> >> > You should use utf-8-emacs-unix. >> >> Have we got clear documentation of how raw-text and utf-8-emacs-unix differ >> and when to use each one? >> >> Does the Emacs Lisp manual refer to that documention from the places >> it would be useful to do so? Eli> We have these two documented. I'm not the right person to say whether Eli> it's clear enough, people should free to complain about unclear Eli> aspects and missing references. Itʼs clear enough to me now. The only thing that trips me up is this kind of phrase (from (info "(elisp)Coding System Basics") When you use ‘raw-text’ to encode multibyte text, it does perform one character code conversion: it converts eight-bit characters to their single-byte external representation. because I always forget that eight-bit characters have a multi-byte representation in the emacs-internal coding system. Robert ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-28 7:06 ` Robert Pluim @ 2019-05-28 14:59 ` Eli Zaretskii 2019-05-28 15:11 ` Robert Pluim 0 siblings, 1 reply; 50+ messages in thread From: Eli Zaretskii @ 2019-05-28 14:59 UTC (permalink / raw) To: Robert Pluim; +Cc: emacs-devel > From: Robert Pluim <rpluim@gmail.com> > Cc: rms@gnu.org, emacs-devel@gnu.org > Date: Tue, 28 May 2019 09:06:37 +0200 > > Itʼs clear enough to me now. The only thing that trips me up is this > kind of phrase (from (info "(elisp)Coding System Basics") > > When you use ‘raw-text’ to encode multibyte text, it does perform > one character code conversion: it converts eight-bit characters to > their single-byte external representation. > > because I always forget that eight-bit characters have a multi-byte > representation in the emacs-internal coding system. So why does that trip you? It reminds you what you tend to forget, so it's a good documentation, right? ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer 2019-05-28 14:59 ` Eli Zaretskii @ 2019-05-28 15:11 ` Robert Pluim 0 siblings, 0 replies; 50+ messages in thread From: Robert Pluim @ 2019-05-28 15:11 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel >>>>> On Tue, 28 May 2019 17:59:39 +0300, Eli Zaretskii <eliz@gnu.org> said: >> From: Robert Pluim <rpluim@gmail.com> >> Cc: rms@gnu.org, emacs-devel@gnu.org >> Date: Tue, 28 May 2019 09:06:37 +0200 >> >> Itʼs clear enough to me now. The only thing that trips me up is this >> kind of phrase (from (info "(elisp)Coding System Basics") >> >> When you use ‘raw-text’ to encode multibyte text, it does perform >> one character code conversion: it converts eight-bit characters to >> their single-byte external representation. >> >> because I always forget that eight-bit characters have a multi-byte >> representation in the emacs-internal coding system. Eli> So why does that trip you? It reminds you what you tend to forget, so Eli> it's a good documentation, right? Itʼs just a moment of cognitive dissonance: "why does it convert eight-bit characters? They're one byte in size! Oh actually they're not". So yes, in that sense itʼs good, it prevents the propagation of my misconception. Robert ^ permalink raw reply [flat|nested] 50+ messages in thread
end of thread, other threads:[~2019-05-29 19:50 UTC | newest] Thread overview: 50+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <20190525191039.14136.23307@vcs0.savannah.gnu.org> [not found] ` <20190525191040.CCD6C207F5@vcs0.savannah.gnu.org> 2019-05-25 19:41 ` [Emacs-diffs] emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer Stefan Monnier 2019-05-25 19:59 ` Eli Zaretskii 2019-05-25 20:15 ` Eli Zaretskii 2019-05-25 21:11 ` Stefan Monnier 2019-05-25 21:27 ` Stefan Monnier 2019-05-26 2:37 ` Eli Zaretskii 2019-05-27 9:47 ` Robert Pluim 2019-05-27 12:24 ` Stefan Monnier 2019-05-27 13:02 ` Robert Pluim 2019-05-27 13:32 ` Stefan Monnier 2019-05-27 13:49 ` Robert Pluim 2019-05-27 16:53 ` Eli Zaretskii 2019-05-28 6:23 ` Robert Pluim 2019-05-28 14:57 ` Eli Zaretskii 2019-05-28 3:08 ` Stefan Monnier 2019-05-28 4:40 ` Eli Zaretskii 2019-05-28 11:55 ` Stefan Monnier 2019-05-28 15:18 ` Eli Zaretskii 2019-05-28 17:43 ` Stefan Monnier 2019-05-28 18:58 ` Eli Zaretskii 2019-05-28 19:35 ` Eli Zaretskii 2019-05-28 23:44 ` Stefan Monnier 2019-05-29 14:33 ` Eli Zaretskii 2019-05-27 16:51 ` Eli Zaretskii 2019-05-27 19:17 ` Stefan Monnier 2019-05-28 2:30 ` Eli Zaretskii 2019-05-28 2:56 ` Stefan Monnier 2019-05-28 4:17 ` Eli Zaretskii 2019-05-28 6:21 ` Robert Pluim 2019-05-28 11:53 ` Stefan Monnier 2019-05-28 11:54 ` Stefan Monnier 2019-05-28 15:11 ` Eli Zaretskii 2019-05-28 17:25 ` Stefan Monnier 2019-05-28 18:51 ` Eli Zaretskii 2019-05-28 23:39 ` Stefan Monnier 2019-05-29 2:45 ` Eli Zaretskii 2019-05-29 16:28 ` Stefan Monnier 2019-05-29 18:19 ` Eli Zaretskii 2019-05-29 18:58 ` Stefan Monnier 2019-05-29 19:09 ` Eli Zaretskii 2019-05-29 19:50 ` Stefan Monnier 2019-05-27 16:43 ` Eli Zaretskii 2019-05-27 16:42 ` Eli Zaretskii 2019-05-27 19:13 ` Stefan Monnier 2019-05-27 16:40 ` Eli Zaretskii 2019-05-27 20:17 ` Richard Stallman 2019-05-28 2:36 ` Eli Zaretskii 2019-05-28 7:06 ` Robert Pluim 2019-05-28 14:59 ` Eli Zaretskii 2019-05-28 15:11 ` Robert Pluim
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.