bug#28179: Fwd: Re: bug#28179: Fix use of string-to-multibyte in ispell.el

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#28179: Fwd: Re: bug#28179: Fix use of string-to-multibyte in ispell.el
       [not found] <f64ac7bd-9952-c09a-71df-f1e123407cff@sc3d.org>
@ 2017-08-23 10:59 ` Reuben Thomas
  2017-08-24 16:59   ` Eli Zaretskii
  0 siblings, 1 reply; 7+ messages in thread
From: Reuben Thomas @ 2017-08-23 10:59 UTC (permalink / raw)
  To: Eli Zaretskii, 28179

On 22/08/17 18:23, Eli Zaretskii wrote:

>> Cc: 28179@debbugs.gnu.org
>> From: Reuben Thomas <rrt@sc3d.org>
>> Date: Tue, 22 Aug 2017 18:04:11 +0100
>>
>> Are you sure we don't need to ensure ispell-get-decoded-string always
>> returns a multibyte string?  What if decode-coding-string returns a
>> pure ASCII string, which is therefore unibyte?
>>
>> This is multibyte too, no? The Emacs manual says:
>>
>>  Rather, Emacs uses a variable-length internal representation of
>>  characters, that stores each character as a sequence of 1 to 5 8-bit
>>  bytes, depending on the magnitude of its codepoint(1). For example, any
>>  ASCII character takes up only 1 byte, a Latin-1 character takes up 2
>>  bytes, etc. We call this representation of text “multibyte”.
> This is a misunderstanding, caused by the overloaded meaning of
> "multibyte string".  The way I meant it, it has to do with the
> internal flag marking a string either unibyte or multibyte.  Observe:
>
>   (multibyte-string-p "abcd") => nil
>
> but
>
>   (multibyte-string-p (decode-coding-string "abcd" 'utf-8)) => t

So here, running decode-coding-string on a plain ASCII string returns a
multibyte string.

> ispell-decode-string, which you replaced with its body. The call to
> string-to-multibyte worked on the result of decoding, not instead of
> the decoding.  So actually the call to string-to-multibyte was not
> replaced, it was removed.

Yes, that call seemed to be unnecessary.

> Is the issue more clear now? 

I now understand the two meanings of "multibyte", but I don't understand
how my patch is deficient. I tried even:

(multibyte-string-p (decode-coding-string "abcde" 'utf-8 t)) ; returns
t; also if I use 'us-ascii

So in fact even when the string isn't copied (as in my patch, where I
also use a third argument of t to decode-coding-string) it appears to be
changed to a multibyte string.

-- 
https://rrt.sc3d.org






^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#28179: Fwd: Re: bug#28179: Fix use of string-to-multibyte in ispell.el
  2017-08-23 10:59 ` bug#28179: Fwd: Re: bug#28179: Fix use of string-to-multibyte in ispell.el Reuben Thomas
@ 2017-08-24 16:59   ` Eli Zaretskii
  2017-08-24 17:32     ` Noam Postavsky
  2017-08-24 17:45     ` Reuben Thomas
  0 siblings, 2 replies; 7+ messages in thread
From: Eli Zaretskii @ 2017-08-24 16:59 UTC (permalink / raw)
  To: Reuben Thomas; +Cc: 28179

> From: Reuben Thomas <rrt@sc3d.org>
> Date: Wed, 23 Aug 2017 11:59:41 +0100
> 
> I now understand the two meanings of "multibyte", but I don't understand
> how my patch is deficient.

I didn't say it was deficient, I asked whether you verified that
either (a) the result is always multibyte, or (b) that we don't need
to worry about it being multibyte if it is pure-ASCII.

> So in fact even when the string isn't copied (as in my patch, where I
> also use a third argument of t to decode-coding-string) it appears to be
> changed to a multibyte string.

Fine, if you are sure, go ahead and push.

Thanks.





^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#28179: Fwd: Re: bug#28179: Fix use of string-to-multibyte in ispell.el
  2017-08-24 16:59   ` Eli Zaretskii
@ 2017-08-24 17:32     ` Noam Postavsky
  2017-08-24 17:45     ` Reuben Thomas
  1 sibling, 0 replies; 7+ messages in thread
From: Noam Postavsky @ 2017-08-24 17:32 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 28179, Reuben Thomas

On Thu, Aug 24, 2017 at 12:59 PM, Eli Zaretskii <eliz@gnu.org> wrote:
>> From: Reuben Thomas <rrt@sc3d.org>
>> So in fact even when the string isn't copied (as in my patch, where I
>> also use a third argument of t to decode-coding-string) it appears to be
>> changed to a multibyte string.
>
> Fine, if you are sure, go ahead and push.

But please, think of the children^H^H^H^H^H^H^H^H readers (of your
patch)! Put this information in the commit message.





^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#28179: Fwd: Re: bug#28179: Fix use of string-to-multibyte in ispell.el
  2017-08-24 16:59   ` Eli Zaretskii
  2017-08-24 17:32     ` Noam Postavsky
@ 2017-08-24 17:45     ` Reuben Thomas
  2017-08-24 18:20       ` Eli Zaretskii
  1 sibling, 1 reply; 7+ messages in thread
From: Reuben Thomas @ 2017-08-24 17:45 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 28179

On 24/08/17 17:59, Eli Zaretskii wrote:
>> From: Reuben Thomas <rrt@sc3d.org>
>> Date: Wed, 23 Aug 2017 11:59:41 +0100
>>
>> I now understand the two meanings of "multibyte", but I don't understand
>> how my patch is deficient.
> I didn't say it was deficient,

Sorry, I was unclear. I meant, precisely, I don't see why you think my
patch's code returns a string that is not multibyte.

>  I asked whether you verified that
> either (a) the result is always multibyte

I believe I showed this is the case.

>
>> So in fact even when the string isn't copied (as in my patch, where I
>> also use a third argument of t to decode-coding-string) it appears to be
>> changed to a multibyte string.
> Fine, if you are sure, go ahead and push.
>

The reason I am asking again is because you first said:

> What if decode-coding-string returns a pure ASCII string, which is
> therefore unibyte?

and then later you said:

> The way I meant it, it has to do with the internal flag marking a
> string either unibyte or multibyte. Observe:
>   (multibyte-string-p "abcd") => nil
>
> but
>
>   (multibyte-string-p (decode-coding-string "abcd" 'utf-8)) => t

In other words:

1. As far as I can tell from the above (and my own confirmatory
experiments and reading of the documentation), a pure ASCII string can
be multibyte (it's a matter of the multibyte flag, not the number of
bytes used to store each character).

2. decode-coding-string always returns a multibyte string.

Since these two observations seemed to mean that you contradicted
yourself, I was checking whether in fact I had misunderstood (so that
for example one of my two observations above is wrong), or if your
original understanding was incomplete (so that in fact your question
about decode-coding-string is therefore misguided, because it can return
a pure ASCII unibyte string (in the coding sense) which is nonetheless a
multibyte string (in the sense that multibyte-string-p on it returns t).

Sorry about the miscommunication. In any case, I think the code is
correct, your original question was misguided, and I shall push, with,
as Noam requested in another message, an explanation of my  assumptions.
No need to reply further unless you think there really is a problem!

-- 

https://rrt.sc3d.org

^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#28179: Fwd: Re: bug#28179: Fix use of string-to-multibyte in ispell.el
  2017-08-24 17:45     ` Reuben Thomas
@ 2017-08-24 18:20       ` Eli Zaretskii
  2017-08-24 18:50         ` Reuben Thomas
  0 siblings, 1 reply; 7+ messages in thread
From: Eli Zaretskii @ 2017-08-24 18:20 UTC (permalink / raw)
  To: Reuben Thomas; +Cc: 28179

> Cc: 28179@debbugs.gnu.org
> From: Reuben Thomas <rrt@sc3d.org>
> Date: Thu, 24 Aug 2017 18:45:33 +0100
> 
> The reason I am asking again is because you first said:
> 
> > What if decode-coding-string returns a pure ASCII string, which is
> > therefore unibyte?
> 
> and then later you said:
> 
> > The way I meant it, it has to do with the internal flag marking a
> > string either unibyte or multibyte. Observe:
> >   (multibyte-string-p "abcd") => nil
> >
> > but
> >
> >   (multibyte-string-p (decode-coding-string "abcd" 'utf-8)) => t

That example may be conclusive for UTF-8, but is it conclusive for
_any_ encoding?  I don't know.  E.g., what about the ISO-2022 based
encodings, where all the bytes are (AFAIR) pure ASCII?

> 1. As far as I can tell from the above (and my own confirmatory
> experiments and reading of the documentation), a pure ASCII string can
> be multibyte (it's a matter of the multibyte flag, not the number of
> bytes used to store each character).
> 
> 2. decode-coding-string always returns a multibyte string.

Can you show me why 2 is always correct?  It might be, I simply don't
know.  All I know is that in general relying on plain-ASCII strings to
be always multibyte in any given situation is risky, we were bitten by
that a few times.  But maybe it's not an issue in this case.  Which is
why I was asking you whether you have sufficient basis to believe this
to be so in this case.

> Since these two observations seemed to mean that you contradicted
> yourself, I was checking whether in fact I had misunderstood (so that
> for example one of my two observations above is wrong), or if your
> original understanding was incomplete (so that in fact your question
> about decode-coding-string is therefore misguided, because it can return
> a pure ASCII unibyte string (in the coding sense) which is nonetheless a
> multibyte string (in the sense that multibyte-string-p on it returns t).

I only used decode-coding-string because I remembered it as an easy
way of creating a multibyte ASCII string, when the coding-system is
UTF-8, that's all.  There was no contradiction in what I said, at
least not an intended one.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#28179: Fwd: Re: bug#28179: Fix use of string-to-multibyte in ispell.el
  2017-08-24 18:20       ` Eli Zaretskii
@ 2017-08-24 18:50         ` Reuben Thomas
  2017-08-24 19:02           ` Eli Zaretskii
  0 siblings, 1 reply; 7+ messages in thread
From: Reuben Thomas @ 2017-08-24 18:50 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 28179

On 24 August 2017 at 19:20, Eli Zaretskii <eliz@gnu.org> wrote:
>> Cc: 28179@debbugs.gnu.org
>> From: Reuben Thomas <rrt@sc3d.org>
>> Date: Thu, 24 Aug 2017 18:45:33 +0100
>>
>> The reason I am asking again is because you first said:
>>
>> > What if decode-coding-string returns a pure ASCII string, which is
>> > therefore unibyte?
>>
>> and then later you said:
>>
>> > The way I meant it, it has to do with the internal flag marking a
>> > string either unibyte or multibyte. Observe:
>> >   (multibyte-string-p "abcd") => nil
>> >
>> > but
>> >
>> >   (multibyte-string-p (decode-coding-string "abcd" 'utf-8)) => t
>
> That example may be conclusive for UTF-8, but is it conclusive for
> _any_ encoding?  I don't know.  E.g., what about the ISO-2022 based
> encodings, where all the bytes are (AFAIR) pure ASCII?

(multibyte-string-p (decode-coding-string "abcd" 'iso-2022-jp)) => t

I still don't understand what you're getting at: the bytes in "abcd"
are pure ASCII, whatever coding system one is decoding from.

> Can you show me why 2 is always correct?  It might be, I simply don't
> know.  All I know is that in general relying on plain-ASCII strings to
> be always multibyte in any given situation is risky, we were bitten by
> that a few times.  But maybe it's not an issue in this case.  Which is
> why I was asking you whether you have sufficient basis to believe this
> to be so in this case.

I don't know.

As I said before, the make-obsolete notice for string-to-multibyte
says "use `decode-coding-string'". If it is as tricky as you suggest
it might be, then the notice should be updated to point to more
detailed guidance.

The relevant commit is:

commit f74d496478cd57f252817bd7437fe1b7972ce01f
Author: Stefan Monnier <monnier@iro.umontreal.ca>
Date:   Mon Jan 30 13:02:18 2017 -0500

    * lisp/subr.el (string-make-unibyte, string-make-multibyte): Obsolete.

diff --git a/lisp/subr.el b/lisp/subr.el
index a6ba05c..a204577 100644
--- a/lisp/subr.el
+++ b/lisp/subr.el
@@ -1417,8 +1417,10 @@ posn-object-width-height
 ;; bug#23850
 (make-obsolete 'string-to-unibyte   "use `encode-coding-string'." "26.1")
 (make-obsolete 'string-as-unibyte   "use `encode-coding-string'." "26.1")
+(make-obsolete 'string-make-unibyte   "use `encode-coding-string'." "26.1")
 (make-obsolete 'string-to-multibyte "use `decode-coding-string'." "26.1")
 (make-obsolete 'string-as-multibyte "use `decode-coding-string'." "26.1")
+(make-obsolete 'string-make-multibyte "use `decode-coding-string'." "26.1")

I'm going to close this bug; if better documentation is needed, both
for the obsolescence of string-to-multibyte and for multibyte strings
in general, that's a new bug.

-- 
https://rrt.sc3d.org





^ permalink raw reply related	[flat|nested] 7+ messages in thread

* bug#28179: Fwd: Re: bug#28179: Fix use of string-to-multibyte in ispell.el
  2017-08-24 18:50         ` Reuben Thomas
@ 2017-08-24 19:02           ` Eli Zaretskii
  0 siblings, 0 replies; 7+ messages in thread
From: Eli Zaretskii @ 2017-08-24 19:02 UTC (permalink / raw)
  To: Reuben Thomas; +Cc: 28179

> From: Reuben Thomas <rrt@sc3d.org>
> Date: Thu, 24 Aug 2017 19:50:17 +0100
> Cc: 28179@debbugs.gnu.org
> 
> >> >   (multibyte-string-p (decode-coding-string "abcd" 'utf-8)) => t
> >
> > That example may be conclusive for UTF-8, but is it conclusive for
> > _any_ encoding?  I don't know.  E.g., what about the ISO-2022 based
> > encodings, where all the bytes are (AFAIR) pure ASCII?
> 
> (multibyte-string-p (decode-coding-string "abcd" 'iso-2022-jp)) => t

That's not what I meant, but never mind.  I only replied to tell there
was no contradiction in my previous messages, and no confusion on my
part, that's all.

Thanks.





^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-08-24 19:02 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <f64ac7bd-9952-c09a-71df-f1e123407cff@sc3d.org>
2017-08-23 10:59 ` bug#28179: Fwd: Re: bug#28179: Fix use of string-to-multibyte in ispell.el Reuben Thomas
2017-08-24 16:59   ` Eli Zaretskii
2017-08-24 17:32     ` Noam Postavsky
2017-08-24 17:45     ` Reuben Thomas
2017-08-24 18:20       ` Eli Zaretskii
2017-08-24 18:50         ` Reuben Thomas
2017-08-24 19:02           ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).