* bug#28179: Fwd: Re: bug#28179: Fix use of string-to-multibyte in ispell.el
[not found] <f64ac7bd-9952-c09a-71df-f1e123407cff@sc3d.org>
@ 2017-08-23 10:59 ` Reuben Thomas
2017-08-24 16:59 ` Eli Zaretskii
0 siblings, 1 reply; 7+ messages in thread
From: Reuben Thomas @ 2017-08-23 10:59 UTC (permalink / raw)
To: Eli Zaretskii, 28179
On 22/08/17 18:23, Eli Zaretskii wrote:
>> Cc: 28179@debbugs.gnu.org
>> From: Reuben Thomas <rrt@sc3d.org>
>> Date: Tue, 22 Aug 2017 18:04:11 +0100
>>
>> Are you sure we don't need to ensure ispell-get-decoded-string always
>> returns a multibyte string? What if decode-coding-string returns a
>> pure ASCII string, which is therefore unibyte?
>>
>> This is multibyte too, no? The Emacs manual says:
>>
>> Rather, Emacs uses a variable-length internal representation of
>> characters, that stores each character as a sequence of 1 to 5 8-bit
>> bytes, depending on the magnitude of its codepoint(1). For example, any
>> ASCII character takes up only 1 byte, a Latin-1 character takes up 2
>> bytes, etc. We call this representation of text “multibyte”.
> This is a misunderstanding, caused by the overloaded meaning of
> "multibyte string". The way I meant it, it has to do with the
> internal flag marking a string either unibyte or multibyte. Observe:
>
> (multibyte-string-p "abcd") => nil
>
> but
>
> (multibyte-string-p (decode-coding-string "abcd" 'utf-8)) => t
So here, running decode-coding-string on a plain ASCII string returns a
multibyte string.
> ispell-decode-string, which you replaced with its body. The call to
> string-to-multibyte worked on the result of decoding, not instead of
> the decoding. So actually the call to string-to-multibyte was not
> replaced, it was removed.
Yes, that call seemed to be unnecessary.
> Is the issue more clear now?
I now understand the two meanings of "multibyte", but I don't understand
how my patch is deficient. I tried even:
(multibyte-string-p (decode-coding-string "abcde" 'utf-8 t)) ; returns
t; also if I use 'us-ascii
So in fact even when the string isn't copied (as in my patch, where I
also use a third argument of t to decode-coding-string) it appears to be
changed to a multibyte string.
--
https://rrt.sc3d.org
^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#28179: Fwd: Re: bug#28179: Fix use of string-to-multibyte in ispell.el
2017-08-23 10:59 ` bug#28179: Fwd: Re: bug#28179: Fix use of string-to-multibyte in ispell.el Reuben Thomas
@ 2017-08-24 16:59 ` Eli Zaretskii
2017-08-24 17:32 ` Noam Postavsky
2017-08-24 17:45 ` Reuben Thomas
0 siblings, 2 replies; 7+ messages in thread
From: Eli Zaretskii @ 2017-08-24 16:59 UTC (permalink / raw)
To: Reuben Thomas; +Cc: 28179
> From: Reuben Thomas <rrt@sc3d.org>
> Date: Wed, 23 Aug 2017 11:59:41 +0100
>
> I now understand the two meanings of "multibyte", but I don't understand
> how my patch is deficient.
I didn't say it was deficient, I asked whether you verified that
either (a) the result is always multibyte, or (b) that we don't need
to worry about it being multibyte if it is pure-ASCII.
> So in fact even when the string isn't copied (as in my patch, where I
> also use a third argument of t to decode-coding-string) it appears to be
> changed to a multibyte string.
Fine, if you are sure, go ahead and push.
Thanks.
^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#28179: Fwd: Re: bug#28179: Fix use of string-to-multibyte in ispell.el
2017-08-24 16:59 ` Eli Zaretskii
@ 2017-08-24 17:32 ` Noam Postavsky
2017-08-24 17:45 ` Reuben Thomas
1 sibling, 0 replies; 7+ messages in thread
From: Noam Postavsky @ 2017-08-24 17:32 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 28179, Reuben Thomas
On Thu, Aug 24, 2017 at 12:59 PM, Eli Zaretskii <eliz@gnu.org> wrote:
>> From: Reuben Thomas <rrt@sc3d.org>
>> So in fact even when the string isn't copied (as in my patch, where I
>> also use a third argument of t to decode-coding-string) it appears to be
>> changed to a multibyte string.
>
> Fine, if you are sure, go ahead and push.
But please, think of the children^H^H^H^H^H^H^H^H readers (of your
patch)! Put this information in the commit message.
^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#28179: Fwd: Re: bug#28179: Fix use of string-to-multibyte in ispell.el
2017-08-24 16:59 ` Eli Zaretskii
2017-08-24 17:32 ` Noam Postavsky
@ 2017-08-24 17:45 ` Reuben Thomas
2017-08-24 18:20 ` Eli Zaretskii
1 sibling, 1 reply; 7+ messages in thread
From: Reuben Thomas @ 2017-08-24 17:45 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 28179
On 24/08/17 17:59, Eli Zaretskii wrote:
>> From: Reuben Thomas <rrt@sc3d.org>
>> Date: Wed, 23 Aug 2017 11:59:41 +0100
>>
>> I now understand the two meanings of "multibyte", but I don't understand
>> how my patch is deficient.
> I didn't say it was deficient,
Sorry, I was unclear. I meant, precisely, I don't see why you think my
patch's code returns a string that is not multibyte.
> I asked whether you verified that
> either (a) the result is always multibyte
I believe I showed this is the case.
>
>> So in fact even when the string isn't copied (as in my patch, where I
>> also use a third argument of t to decode-coding-string) it appears to be
>> changed to a multibyte string.
> Fine, if you are sure, go ahead and push.
>
The reason I am asking again is because you first said:
> What if decode-coding-string returns a pure ASCII string, which is
> therefore unibyte?
and then later you said:
> The way I meant it, it has to do with the internal flag marking a
> string either unibyte or multibyte. Observe:
> (multibyte-string-p "abcd") => nil
>
> but
>
> (multibyte-string-p (decode-coding-string "abcd" 'utf-8)) => t
In other words:
1. As far as I can tell from the above (and my own confirmatory
experiments and reading of the documentation), a pure ASCII string can
be multibyte (it's a matter of the multibyte flag, not the number of
bytes used to store each character).
2. decode-coding-string always returns a multibyte string.
Since these two observations seemed to mean that you contradicted
yourself, I was checking whether in fact I had misunderstood (so that
for example one of my two observations above is wrong), or if your
original understanding was incomplete (so that in fact your question
about decode-coding-string is therefore misguided, because it can return
a pure ASCII unibyte string (in the coding sense) which is nonetheless a
multibyte string (in the sense that multibyte-string-p on it returns t).
Sorry about the miscommunication. In any case, I think the code is
correct, your original question was misguided, and I shall push, with,
as Noam requested in another message, an explanation of my assumptions.
No need to reply further unless you think there really is a problem!
--
https://rrt.sc3d.org
^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#28179: Fwd: Re: bug#28179: Fix use of string-to-multibyte in ispell.el
2017-08-24 17:45 ` Reuben Thomas
@ 2017-08-24 18:20 ` Eli Zaretskii
2017-08-24 18:50 ` Reuben Thomas
0 siblings, 1 reply; 7+ messages in thread
From: Eli Zaretskii @ 2017-08-24 18:20 UTC (permalink / raw)
To: Reuben Thomas; +Cc: 28179
> Cc: 28179@debbugs.gnu.org
> From: Reuben Thomas <rrt@sc3d.org>
> Date: Thu, 24 Aug 2017 18:45:33 +0100
>
> The reason I am asking again is because you first said:
>
> > What if decode-coding-string returns a pure ASCII string, which is
> > therefore unibyte?
>
> and then later you said:
>
> > The way I meant it, it has to do with the internal flag marking a
> > string either unibyte or multibyte. Observe:
> > (multibyte-string-p "abcd") => nil
> >
> > but
> >
> > (multibyte-string-p (decode-coding-string "abcd" 'utf-8)) => t
That example may be conclusive for UTF-8, but is it conclusive for
_any_ encoding? I don't know. E.g., what about the ISO-2022 based
encodings, where all the bytes are (AFAIR) pure ASCII?
> 1. As far as I can tell from the above (and my own confirmatory
> experiments and reading of the documentation), a pure ASCII string can
> be multibyte (it's a matter of the multibyte flag, not the number of
> bytes used to store each character).
>
> 2. decode-coding-string always returns a multibyte string.
Can you show me why 2 is always correct? It might be, I simply don't
know. All I know is that in general relying on plain-ASCII strings to
be always multibyte in any given situation is risky, we were bitten by
that a few times. But maybe it's not an issue in this case. Which is
why I was asking you whether you have sufficient basis to believe this
to be so in this case.
> Since these two observations seemed to mean that you contradicted
> yourself, I was checking whether in fact I had misunderstood (so that
> for example one of my two observations above is wrong), or if your
> original understanding was incomplete (so that in fact your question
> about decode-coding-string is therefore misguided, because it can return
> a pure ASCII unibyte string (in the coding sense) which is nonetheless a
> multibyte string (in the sense that multibyte-string-p on it returns t).
I only used decode-coding-string because I remembered it as an easy
way of creating a multibyte ASCII string, when the coding-system is
UTF-8, that's all. There was no contradiction in what I said, at
least not an intended one.
^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#28179: Fwd: Re: bug#28179: Fix use of string-to-multibyte in ispell.el
2017-08-24 18:20 ` Eli Zaretskii
@ 2017-08-24 18:50 ` Reuben Thomas
2017-08-24 19:02 ` Eli Zaretskii
0 siblings, 1 reply; 7+ messages in thread
From: Reuben Thomas @ 2017-08-24 18:50 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 28179
On 24 August 2017 at 19:20, Eli Zaretskii <eliz@gnu.org> wrote:
>> Cc: 28179@debbugs.gnu.org
>> From: Reuben Thomas <rrt@sc3d.org>
>> Date: Thu, 24 Aug 2017 18:45:33 +0100
>>
>> The reason I am asking again is because you first said:
>>
>> > What if decode-coding-string returns a pure ASCII string, which is
>> > therefore unibyte?
>>
>> and then later you said:
>>
>> > The way I meant it, it has to do with the internal flag marking a
>> > string either unibyte or multibyte. Observe:
>> > (multibyte-string-p "abcd") => nil
>> >
>> > but
>> >
>> > (multibyte-string-p (decode-coding-string "abcd" 'utf-8)) => t
>
> That example may be conclusive for UTF-8, but is it conclusive for
> _any_ encoding? I don't know. E.g., what about the ISO-2022 based
> encodings, where all the bytes are (AFAIR) pure ASCII?
(multibyte-string-p (decode-coding-string "abcd" 'iso-2022-jp)) => t
I still don't understand what you're getting at: the bytes in "abcd"
are pure ASCII, whatever coding system one is decoding from.
> Can you show me why 2 is always correct? It might be, I simply don't
> know. All I know is that in general relying on plain-ASCII strings to
> be always multibyte in any given situation is risky, we were bitten by
> that a few times. But maybe it's not an issue in this case. Which is
> why I was asking you whether you have sufficient basis to believe this
> to be so in this case.
I don't know.
As I said before, the make-obsolete notice for string-to-multibyte
says "use `decode-coding-string'". If it is as tricky as you suggest
it might be, then the notice should be updated to point to more
detailed guidance.
The relevant commit is:
commit f74d496478cd57f252817bd7437fe1b7972ce01f
Author: Stefan Monnier <monnier@iro.umontreal.ca>
Date: Mon Jan 30 13:02:18 2017 -0500
* lisp/subr.el (string-make-unibyte, string-make-multibyte): Obsolete.
diff --git a/lisp/subr.el b/lisp/subr.el
index a6ba05c..a204577 100644
--- a/lisp/subr.el
+++ b/lisp/subr.el
@@ -1417,8 +1417,10 @@ posn-object-width-height
;; bug#23850
(make-obsolete 'string-to-unibyte "use `encode-coding-string'." "26.1")
(make-obsolete 'string-as-unibyte "use `encode-coding-string'." "26.1")
+(make-obsolete 'string-make-unibyte "use `encode-coding-string'." "26.1")
(make-obsolete 'string-to-multibyte "use `decode-coding-string'." "26.1")
(make-obsolete 'string-as-multibyte "use `decode-coding-string'." "26.1")
+(make-obsolete 'string-make-multibyte "use `decode-coding-string'." "26.1")
I'm going to close this bug; if better documentation is needed, both
for the obsolescence of string-to-multibyte and for multibyte strings
in general, that's a new bug.
--
https://rrt.sc3d.org
^ permalink raw reply related [flat|nested] 7+ messages in thread
* bug#28179: Fwd: Re: bug#28179: Fix use of string-to-multibyte in ispell.el
2017-08-24 18:50 ` Reuben Thomas
@ 2017-08-24 19:02 ` Eli Zaretskii
0 siblings, 0 replies; 7+ messages in thread
From: Eli Zaretskii @ 2017-08-24 19:02 UTC (permalink / raw)
To: Reuben Thomas; +Cc: 28179
> From: Reuben Thomas <rrt@sc3d.org>
> Date: Thu, 24 Aug 2017 19:50:17 +0100
> Cc: 28179@debbugs.gnu.org
>
> >> > (multibyte-string-p (decode-coding-string "abcd" 'utf-8)) => t
> >
> > That example may be conclusive for UTF-8, but is it conclusive for
> > _any_ encoding? I don't know. E.g., what about the ISO-2022 based
> > encodings, where all the bytes are (AFAIR) pure ASCII?
>
> (multibyte-string-p (decode-coding-string "abcd" 'iso-2022-jp)) => t
That's not what I meant, but never mind. I only replied to tell there
was no contradiction in my previous messages, and no confusion on my
part, that's all.
Thanks.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2017-08-24 19:02 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <f64ac7bd-9952-c09a-71df-f1e123407cff@sc3d.org>
2017-08-23 10:59 ` bug#28179: Fwd: Re: bug#28179: Fix use of string-to-multibyte in ispell.el Reuben Thomas
2017-08-24 16:59 ` Eli Zaretskii
2017-08-24 17:32 ` Noam Postavsky
2017-08-24 17:45 ` Reuben Thomas
2017-08-24 18:20 ` Eli Zaretskii
2017-08-24 18:50 ` Reuben Thomas
2017-08-24 19:02 ` Eli Zaretskii
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.