More confusion about multibyte vs unibyte strings

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* More confusion about multibyte vs unibyte strings
@ 2022-05-05 16:58 Eric Abrahamsen
  2022-05-05 17:34 ` Eli Zaretskii
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Abrahamsen @ 2022-05-05 16:58 UTC (permalink / raw)
  To: help-gnu-emacs

In gnus-search.el, we do some work on search strings before sending them
to an IMAP server as a query: there are particular formats that need to
be used depending on whether the string is plain ASCII, or needs to be
encoded as UTF-8 or something. From the code itself:

(gnus-search-imap-handle-string
 (make-instance 'gnus-search-imap :literal-plus t)
 "FROM eric")

-> "FROM eric"

(gnus-search-imap-handle-string
 (make-instance 'gnus-search-imap :literal-plus t)
 "FROM 张三")

-> "{11+}
FROM \345\274\240\344\270\211"

The function above uses `multibyte-string-p' to test whether the string
needs the extra handling. This works correctly in the minibuffer and
*scratch*:

(multibyte-string-p "FROM eric") -> nil

(multibyte-string-p "FROM 张三") -> t

but when I edebug the code during an actual IMAP search, the test
returns t for both strings, which messes things up.

I must be using it wrong! But I don't understand why. What can change in
the evaluation environment such that the calls to `multibyte-string-p'
would return different results at different times? And what check
*should* I be using to see if a string is pure ASCII?

Thanks,
Eric

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: More confusion about multibyte vs unibyte strings
  2022-05-05 16:58 More confusion about multibyte vs unibyte strings Eric Abrahamsen
@ 2022-05-05 17:34 ` Eli Zaretskii
  2022-05-05 18:44   ` Eric Abrahamsen
  0 siblings, 1 reply; 9+ messages in thread
From: Eli Zaretskii @ 2022-05-05 17:34 UTC (permalink / raw)
  To: help-gnu-emacs

> From: Eric Abrahamsen <eric@ericabrahamsen.net>
> Date: Thu, 05 May 2022 09:58:43 -0700
> 
> The function above uses `multibyte-string-p' to test whether the string
> needs the extra handling. This works correctly in the minibuffer and
> *scratch*:
> 
> (multibyte-string-p "FROM eric") -> nil
> 
> (multibyte-string-p "FROM 张三") -> t
> 
> but when I edebug the code during an actual IMAP search, the test
> returns t for both strings, which messes things up.

Why does it "mess things up", and what exactly is the nature of the
mess-up?  A pure-ASCII string can be either unibyte or multibyte, and
that shouldn't change a thing.

> I must be using it wrong! But I don't understand why. What can change in
> the evaluation environment such that the calls to `multibyte-string-p'
> would return different results at different times?

Any number of string operations can convert a pure-ASCII string into a
multibyte string.  The most frequent one is decode-coding-string.

Again, why should this be a problem for your code?

> And what check *should* I be using to see if a string is pure ASCII?

Why do you care?



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: More confusion about multibyte vs unibyte strings
  2022-05-05 17:34 ` Eli Zaretskii
@ 2022-05-05 18:44   ` Eric Abrahamsen
  2022-05-05 19:23     ` Eli Zaretskii
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Abrahamsen @ 2022-05-05 18:44 UTC (permalink / raw)
  To: help-gnu-emacs

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Eric Abrahamsen <eric@ericabrahamsen.net>
>> Date: Thu, 05 May 2022 09:58:43 -0700
>> 
>> The function above uses `multibyte-string-p' to test whether the string
>> needs the extra handling. This works correctly in the minibuffer and
>> *scratch*:
>> 
>> (multibyte-string-p "FROM eric") -> nil
>> 
>> (multibyte-string-p "FROM 张三") -> t
>> 
>> but when I edebug the code during an actual IMAP search, the test
>> returns t for both strings, which messes things up.
>
> Why does it "mess things up", and what exactly is the nature of the
> mess-up?  A pure-ASCII string can be either unibyte or multibyte, and
> that shouldn't change a thing.

If the string is not ASCII, we need to encode it before sending to the
server, and tell the server what encoding we used. Microsoft Exchange
servers can't handle any encoding other than ascii. So if our code thinks
a string isn't ascii, it sends the encoding message to the IMAP server,
and Exchange blows up. If the string is ascii, we don't try to encode
it, and everything's fine. So I need to know whether the string is
actually ascii or not.

I can solve this some other way, like
(equal (length str) (string-bytes str))
but I'm just trying to figure out why this doesn't behave the way I
expect it to. I'd thought that `multibyte-string-p' essentially
performed the above length test.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: More confusion about multibyte vs unibyte strings
  2022-05-05 18:44   ` Eric Abrahamsen
@ 2022-05-05 19:23     ` Eli Zaretskii
  2022-05-06  0:45       ` Eric Abrahamsen
  0 siblings, 1 reply; 9+ messages in thread
From: Eli Zaretskii @ 2022-05-05 19:23 UTC (permalink / raw)
  To: help-gnu-emacs

> From: Eric Abrahamsen <eric@ericabrahamsen.net>
> Date: Thu, 05 May 2022 11:44:41 -0700
> 
> > Why does it "mess things up", and what exactly is the nature of the
> > mess-up?  A pure-ASCII string can be either unibyte or multibyte, and
> > that shouldn't change a thing.
> 
> If the string is not ASCII, we need to encode it before sending to the
> server, and tell the server what encoding we used. Microsoft Exchange
> servers can't handle any encoding other than ascii.

What do you mean by "ascii encoding" in this context?

When you say that Microsoft Exchange can't handle any encoding other
than ascii, does it mean it cannot handle _any_ non-ASCII addressee
names?  That'd be hard to believe, because such addressee names are
nowadays in wide use.  So I guess you mean something else, but what?

> So if our code thinks a string isn't ascii, it sends the encoding
> message to the IMAP server, and Exchange blows up.

Encoding ascii yields a string that is identical to the original (IIUC
what you mean by "encoding"), so I don't follow you here.

> If the string is ascii, we don't try to encode it, and everything's
> fine. So I need to know whether the string is actually ascii or not.

You can do that using the regexp class [:ascii:], I guess.

> I can solve this some other way, like
> (equal (length str) (string-bytes str))

That should return non-nil for unibyte string that includes bytes
above 127 as well, no?

> but I'm just trying to figure out why this doesn't behave the way I
> expect it to. I'd thought that `multibyte-string-p' essentially
> performed the above length test.

No, it doesn't.  A pure ASCII string can be made multibyte without
changing its payload, and Emacs usually makes unibyte strings out of
pure ASCII characters.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: More confusion about multibyte vs unibyte strings
  2022-05-05 19:23     ` Eli Zaretskii
@ 2022-05-06  0:45       ` Eric Abrahamsen
  2022-05-06  2:58         ` Stefan Monnier via Users list for the GNU Emacs text editor
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Abrahamsen @ 2022-05-06  0:45 UTC (permalink / raw)
  To: help-gnu-emacs

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Eric Abrahamsen <eric@ericabrahamsen.net>
>> Date: Thu, 05 May 2022 11:44:41 -0700
>> 
>> > Why does it "mess things up", and what exactly is the nature of the
>> > mess-up?  A pure-ASCII string can be either unibyte or multibyte, and
>> > that shouldn't change a thing.
>> 
>> If the string is not ASCII, we need to encode it before sending to the
>> server, and tell the server what encoding we used. Microsoft Exchange
>> servers can't handle any encoding other than ascii.
>
> What do you mean by "ascii encoding" in this context?
>
> When you say that Microsoft Exchange can't handle any encoding other
> than ascii, does it mean it cannot handle _any_ non-ASCII addressee
> names?  That'd be hard to believe, because such addressee names are
> nowadays in wide use.  So I guess you mean something else, but what?

The IMAP search command can look like "UID SEARCH", or "UID SEARCH
CHARSET XXX". Specifying no charset is (I think) the same as specifying
US-ASCII, which is the only charset that Exchange accepts for the search
command.

If the search string is multibyte (in my mind this means "multiple bytes
per character", I guess that's where I went wrong), you have to encode
it as something, tell the server what charset you used to encode it,
then send both the encoded string and the number of bytes it represents.
The gnus-search code encodes it as emacs-utf-8, and then sends UID
SEARCH CHARSET UTF-8, which Exchange won't accept.

>> So if our code thinks a string isn't ascii, it sends the encoding
>> message to the IMAP server, and Exchange blows up.
>
> Encoding ascii yields a string that is identical to the original (IIUC
> what you mean by "encoding"), so I don't follow you here.
>
>> If the string is ascii, we don't try to encode it, and everything's
>> fine. So I need to know whether the string is actually ascii or not.
>
> You can do that using the regexp class [:ascii:], I guess.

That's how I'll solve it, then.




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: More confusion about multibyte vs unibyte strings
  2022-05-06  0:45       ` Eric Abrahamsen
@ 2022-05-06  2:58         ` Stefan Monnier via Users list for the GNU Emacs text editor
  2022-05-06 16:45           ` Eric Abrahamsen
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Monnier via Users list for the GNU Emacs text editor @ 2022-05-06  2:58 UTC (permalink / raw)
  To: help-gnu-emacs

> If the search string is multibyte (in my mind this means "multiple bytes
> per character", I guess that's where I went wrong), you have to encode

In ELisp, "multibyte" means "a sequence of characters", whereas
"unibyte" means "a sequence of bytes".


        Stefan




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: More confusion about multibyte vs unibyte strings
  2022-05-06  2:58         ` Stefan Monnier via Users list for the GNU Emacs text editor
@ 2022-05-06 16:45           ` Eric Abrahamsen
  2022-05-06 17:39             ` Stefan Monnier via Users list for the GNU Emacs text editor
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Abrahamsen @ 2022-05-06 16:45 UTC (permalink / raw)
  To: help-gnu-emacs

Stefan Monnier via Users list for the GNU Emacs text editor
<help-gnu-emacs@gnu.org> writes:

>> If the search string is multibyte (in my mind this means "multiple bytes
>> per character", I guess that's where I went wrong), you have to encode
>
> In ELisp, "multibyte" means "a sequence of characters", whereas
> "unibyte" means "a sequence of bytes".

Okay, thanks. I'd thought that distinction was covered by "encoded" vs
"decoded" strings. Maybe the lesson will stick this time.




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: More confusion about multibyte vs unibyte strings
  2022-05-06 16:45           ` Eric Abrahamsen
@ 2022-05-06 17:39             ` Stefan Monnier via Users list for the GNU Emacs text editor
  2022-05-06 18:02               ` Eric Abrahamsen
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Monnier via Users list for the GNU Emacs text editor @ 2022-05-06 17:39 UTC (permalink / raw)
  To: help-gnu-emacs

>>> If the search string is multibyte (in my mind this means "multiple bytes
>>> per character", I guess that's where I went wrong), you have to encode
>>
>> In ELisp, "multibyte" means "a sequence of characters", whereas
>> "unibyte" means "a sequence of bytes".
>
> Okay, thanks. I'd thought that distinction was covered by "encoded" vs
> "decoded" strings. Maybe the lesson will stick this time.

There's no reliable way to determine whether a string is decoded (other
than to trace its origin and figure out what the code intended it to
mean).

This said, multibyte/unibyte can be used as an approximation of
decoded/encoded (my own local hacks include signaling errors when
trying to decode a multibyte string or to encode a unibyte string, but
it trips over various places where we do that for legitimate
reasons :-( )


        Stefan




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: More confusion about multibyte vs unibyte strings
  2022-05-06 17:39             ` Stefan Monnier via Users list for the GNU Emacs text editor
@ 2022-05-06 18:02               ` Eric Abrahamsen
  0 siblings, 0 replies; 9+ messages in thread
From: Eric Abrahamsen @ 2022-05-06 18:02 UTC (permalink / raw)
  To: help-gnu-emacs

Stefan Monnier via Users list for the GNU Emacs text editor
<help-gnu-emacs@gnu.org> writes:

>>>> If the search string is multibyte (in my mind this means "multiple bytes
>>>> per character", I guess that's where I went wrong), you have to encode
>>>
>>> In ELisp, "multibyte" means "a sequence of characters", whereas
>>> "unibyte" means "a sequence of bytes".
>>
>> Okay, thanks. I'd thought that distinction was covered by "encoded" vs
>> "decoded" strings. Maybe the lesson will stick this time.
>
> There's no reliable way to determine whether a string is decoded (other
> than to trace its origin and figure out what the code intended it to
> mean).
>
> This said, multibyte/unibyte can be used as an approximation of
> decoded/encoded (my own local hacks include signaling errors when
> trying to decode a multibyte string or to encode a unibyte string, but
> it trips over various places where we do that for legitimate
> reasons :-( )

Thanks for this explanation! I'm grateful that my bit of code doesn't
actually need to be that complicated...




^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-05-06 18:02 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-05-05 16:58 More confusion about multibyte vs unibyte strings Eric Abrahamsen
2022-05-05 17:34 ` Eli Zaretskii
2022-05-05 18:44   ` Eric Abrahamsen
2022-05-05 19:23     ` Eli Zaretskii
2022-05-06  0:45       ` Eric Abrahamsen
2022-05-06  2:58         ` Stefan Monnier via Users list for the GNU Emacs text editor
2022-05-06 16:45           ` Eric Abrahamsen
2022-05-06 17:39             ` Stefan Monnier via Users list for the GNU Emacs text editor
2022-05-06 18:02               ` Eric Abrahamsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).