* More confusion about multibyte vs unibyte strings @ 2022-05-05 16:58 Eric Abrahamsen 2022-05-05 17:34 ` Eli Zaretskii 0 siblings, 1 reply; 9+ messages in thread From: Eric Abrahamsen @ 2022-05-05 16:58 UTC (permalink / raw) To: help-gnu-emacs In gnus-search.el, we do some work on search strings before sending them to an IMAP server as a query: there are particular formats that need to be used depending on whether the string is plain ASCII, or needs to be encoded as UTF-8 or something. From the code itself: (gnus-search-imap-handle-string (make-instance 'gnus-search-imap :literal-plus t) "FROM eric") -> "FROM eric" (gnus-search-imap-handle-string (make-instance 'gnus-search-imap :literal-plus t) "FROM 张三") -> "{11+} FROM \345\274\240\344\270\211" The function above uses `multibyte-string-p' to test whether the string needs the extra handling. This works correctly in the minibuffer and *scratch*: (multibyte-string-p "FROM eric") -> nil (multibyte-string-p "FROM 张三") -> t but when I edebug the code during an actual IMAP search, the test returns t for both strings, which messes things up. I must be using it wrong! But I don't understand why. What can change in the evaluation environment such that the calls to `multibyte-string-p' would return different results at different times? And what check *should* I be using to see if a string is pure ASCII? Thanks, Eric ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: More confusion about multibyte vs unibyte strings 2022-05-05 16:58 More confusion about multibyte vs unibyte strings Eric Abrahamsen @ 2022-05-05 17:34 ` Eli Zaretskii 2022-05-05 18:44 ` Eric Abrahamsen 0 siblings, 1 reply; 9+ messages in thread From: Eli Zaretskii @ 2022-05-05 17:34 UTC (permalink / raw) To: help-gnu-emacs > From: Eric Abrahamsen <eric@ericabrahamsen.net> > Date: Thu, 05 May 2022 09:58:43 -0700 > > The function above uses `multibyte-string-p' to test whether the string > needs the extra handling. This works correctly in the minibuffer and > *scratch*: > > (multibyte-string-p "FROM eric") -> nil > > (multibyte-string-p "FROM 张三") -> t > > but when I edebug the code during an actual IMAP search, the test > returns t for both strings, which messes things up. Why does it "mess things up", and what exactly is the nature of the mess-up? A pure-ASCII string can be either unibyte or multibyte, and that shouldn't change a thing. > I must be using it wrong! But I don't understand why. What can change in > the evaluation environment such that the calls to `multibyte-string-p' > would return different results at different times? Any number of string operations can convert a pure-ASCII string into a multibyte string. The most frequent one is decode-coding-string. Again, why should this be a problem for your code? > And what check *should* I be using to see if a string is pure ASCII? Why do you care? ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: More confusion about multibyte vs unibyte strings 2022-05-05 17:34 ` Eli Zaretskii @ 2022-05-05 18:44 ` Eric Abrahamsen 2022-05-05 19:23 ` Eli Zaretskii 0 siblings, 1 reply; 9+ messages in thread From: Eric Abrahamsen @ 2022-05-05 18:44 UTC (permalink / raw) To: help-gnu-emacs Eli Zaretskii <eliz@gnu.org> writes: >> From: Eric Abrahamsen <eric@ericabrahamsen.net> >> Date: Thu, 05 May 2022 09:58:43 -0700 >> >> The function above uses `multibyte-string-p' to test whether the string >> needs the extra handling. This works correctly in the minibuffer and >> *scratch*: >> >> (multibyte-string-p "FROM eric") -> nil >> >> (multibyte-string-p "FROM 张三") -> t >> >> but when I edebug the code during an actual IMAP search, the test >> returns t for both strings, which messes things up. > > Why does it "mess things up", and what exactly is the nature of the > mess-up? A pure-ASCII string can be either unibyte or multibyte, and > that shouldn't change a thing. If the string is not ASCII, we need to encode it before sending to the server, and tell the server what encoding we used. Microsoft Exchange servers can't handle any encoding other than ascii. So if our code thinks a string isn't ascii, it sends the encoding message to the IMAP server, and Exchange blows up. If the string is ascii, we don't try to encode it, and everything's fine. So I need to know whether the string is actually ascii or not. I can solve this some other way, like (equal (length str) (string-bytes str)) but I'm just trying to figure out why this doesn't behave the way I expect it to. I'd thought that `multibyte-string-p' essentially performed the above length test. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: More confusion about multibyte vs unibyte strings 2022-05-05 18:44 ` Eric Abrahamsen @ 2022-05-05 19:23 ` Eli Zaretskii 2022-05-06 0:45 ` Eric Abrahamsen 0 siblings, 1 reply; 9+ messages in thread From: Eli Zaretskii @ 2022-05-05 19:23 UTC (permalink / raw) To: help-gnu-emacs > From: Eric Abrahamsen <eric@ericabrahamsen.net> > Date: Thu, 05 May 2022 11:44:41 -0700 > > > Why does it "mess things up", and what exactly is the nature of the > > mess-up? A pure-ASCII string can be either unibyte or multibyte, and > > that shouldn't change a thing. > > If the string is not ASCII, we need to encode it before sending to the > server, and tell the server what encoding we used. Microsoft Exchange > servers can't handle any encoding other than ascii. What do you mean by "ascii encoding" in this context? When you say that Microsoft Exchange can't handle any encoding other than ascii, does it mean it cannot handle _any_ non-ASCII addressee names? That'd be hard to believe, because such addressee names are nowadays in wide use. So I guess you mean something else, but what? > So if our code thinks a string isn't ascii, it sends the encoding > message to the IMAP server, and Exchange blows up. Encoding ascii yields a string that is identical to the original (IIUC what you mean by "encoding"), so I don't follow you here. > If the string is ascii, we don't try to encode it, and everything's > fine. So I need to know whether the string is actually ascii or not. You can do that using the regexp class [:ascii:], I guess. > I can solve this some other way, like > (equal (length str) (string-bytes str)) That should return non-nil for unibyte string that includes bytes above 127 as well, no? > but I'm just trying to figure out why this doesn't behave the way I > expect it to. I'd thought that `multibyte-string-p' essentially > performed the above length test. No, it doesn't. A pure ASCII string can be made multibyte without changing its payload, and Emacs usually makes unibyte strings out of pure ASCII characters. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: More confusion about multibyte vs unibyte strings 2022-05-05 19:23 ` Eli Zaretskii @ 2022-05-06 0:45 ` Eric Abrahamsen 2022-05-06 2:58 ` Stefan Monnier via Users list for the GNU Emacs text editor 0 siblings, 1 reply; 9+ messages in thread From: Eric Abrahamsen @ 2022-05-06 0:45 UTC (permalink / raw) To: help-gnu-emacs Eli Zaretskii <eliz@gnu.org> writes: >> From: Eric Abrahamsen <eric@ericabrahamsen.net> >> Date: Thu, 05 May 2022 11:44:41 -0700 >> >> > Why does it "mess things up", and what exactly is the nature of the >> > mess-up? A pure-ASCII string can be either unibyte or multibyte, and >> > that shouldn't change a thing. >> >> If the string is not ASCII, we need to encode it before sending to the >> server, and tell the server what encoding we used. Microsoft Exchange >> servers can't handle any encoding other than ascii. > > What do you mean by "ascii encoding" in this context? > > When you say that Microsoft Exchange can't handle any encoding other > than ascii, does it mean it cannot handle _any_ non-ASCII addressee > names? That'd be hard to believe, because such addressee names are > nowadays in wide use. So I guess you mean something else, but what? The IMAP search command can look like "UID SEARCH", or "UID SEARCH CHARSET XXX". Specifying no charset is (I think) the same as specifying US-ASCII, which is the only charset that Exchange accepts for the search command. If the search string is multibyte (in my mind this means "multiple bytes per character", I guess that's where I went wrong), you have to encode it as something, tell the server what charset you used to encode it, then send both the encoded string and the number of bytes it represents. The gnus-search code encodes it as emacs-utf-8, and then sends UID SEARCH CHARSET UTF-8, which Exchange won't accept. >> So if our code thinks a string isn't ascii, it sends the encoding >> message to the IMAP server, and Exchange blows up. > > Encoding ascii yields a string that is identical to the original (IIUC > what you mean by "encoding"), so I don't follow you here. > >> If the string is ascii, we don't try to encode it, and everything's >> fine. So I need to know whether the string is actually ascii or not. > > You can do that using the regexp class [:ascii:], I guess. That's how I'll solve it, then. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: More confusion about multibyte vs unibyte strings 2022-05-06 0:45 ` Eric Abrahamsen @ 2022-05-06 2:58 ` Stefan Monnier via Users list for the GNU Emacs text editor 2022-05-06 16:45 ` Eric Abrahamsen 0 siblings, 1 reply; 9+ messages in thread From: Stefan Monnier via Users list for the GNU Emacs text editor @ 2022-05-06 2:58 UTC (permalink / raw) To: help-gnu-emacs > If the search string is multibyte (in my mind this means "multiple bytes > per character", I guess that's where I went wrong), you have to encode In ELisp, "multibyte" means "a sequence of characters", whereas "unibyte" means "a sequence of bytes". Stefan ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: More confusion about multibyte vs unibyte strings 2022-05-06 2:58 ` Stefan Monnier via Users list for the GNU Emacs text editor @ 2022-05-06 16:45 ` Eric Abrahamsen 2022-05-06 17:39 ` Stefan Monnier via Users list for the GNU Emacs text editor 0 siblings, 1 reply; 9+ messages in thread From: Eric Abrahamsen @ 2022-05-06 16:45 UTC (permalink / raw) To: help-gnu-emacs Stefan Monnier via Users list for the GNU Emacs text editor <help-gnu-emacs@gnu.org> writes: >> If the search string is multibyte (in my mind this means "multiple bytes >> per character", I guess that's where I went wrong), you have to encode > > In ELisp, "multibyte" means "a sequence of characters", whereas > "unibyte" means "a sequence of bytes". Okay, thanks. I'd thought that distinction was covered by "encoded" vs "decoded" strings. Maybe the lesson will stick this time. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: More confusion about multibyte vs unibyte strings 2022-05-06 16:45 ` Eric Abrahamsen @ 2022-05-06 17:39 ` Stefan Monnier via Users list for the GNU Emacs text editor 2022-05-06 18:02 ` Eric Abrahamsen 0 siblings, 1 reply; 9+ messages in thread From: Stefan Monnier via Users list for the GNU Emacs text editor @ 2022-05-06 17:39 UTC (permalink / raw) To: help-gnu-emacs >>> If the search string is multibyte (in my mind this means "multiple bytes >>> per character", I guess that's where I went wrong), you have to encode >> >> In ELisp, "multibyte" means "a sequence of characters", whereas >> "unibyte" means "a sequence of bytes". > > Okay, thanks. I'd thought that distinction was covered by "encoded" vs > "decoded" strings. Maybe the lesson will stick this time. There's no reliable way to determine whether a string is decoded (other than to trace its origin and figure out what the code intended it to mean). This said, multibyte/unibyte can be used as an approximation of decoded/encoded (my own local hacks include signaling errors when trying to decode a multibyte string or to encode a unibyte string, but it trips over various places where we do that for legitimate reasons :-( ) Stefan ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: More confusion about multibyte vs unibyte strings 2022-05-06 17:39 ` Stefan Monnier via Users list for the GNU Emacs text editor @ 2022-05-06 18:02 ` Eric Abrahamsen 0 siblings, 0 replies; 9+ messages in thread From: Eric Abrahamsen @ 2022-05-06 18:02 UTC (permalink / raw) To: help-gnu-emacs Stefan Monnier via Users list for the GNU Emacs text editor <help-gnu-emacs@gnu.org> writes: >>>> If the search string is multibyte (in my mind this means "multiple bytes >>>> per character", I guess that's where I went wrong), you have to encode >>> >>> In ELisp, "multibyte" means "a sequence of characters", whereas >>> "unibyte" means "a sequence of bytes". >> >> Okay, thanks. I'd thought that distinction was covered by "encoded" vs >> "decoded" strings. Maybe the lesson will stick this time. > > There's no reliable way to determine whether a string is decoded (other > than to trace its origin and figure out what the code intended it to > mean). > > This said, multibyte/unibyte can be used as an approximation of > decoded/encoded (my own local hacks include signaling errors when > trying to decode a multibyte string or to encode a unibyte string, but > it trips over various places where we do that for legitimate > reasons :-( ) Thanks for this explanation! I'm grateful that my bit of code doesn't actually need to be that complicated... ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2022-05-06 18:02 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-05-05 16:58 More confusion about multibyte vs unibyte strings Eric Abrahamsen 2022-05-05 17:34 ` Eli Zaretskii 2022-05-05 18:44 ` Eric Abrahamsen 2022-05-05 19:23 ` Eli Zaretskii 2022-05-06 0:45 ` Eric Abrahamsen 2022-05-06 2:58 ` Stefan Monnier via Users list for the GNU Emacs text editor 2022-05-06 16:45 ` Eric Abrahamsen 2022-05-06 17:39 ` Stefan Monnier via Users list for the GNU Emacs text editor 2022-05-06 18:02 ` Eric Abrahamsen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).