From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.help Subject: Re: [Solved] RE: Differences between identical strings in Emacs lisp Date: Tue, 07 Apr 2015 17:22:22 +0300 Message-ID: <83mw2khvc1.fsf@gnu.org> References: <87pp7gu7by.fsf@kuiper.lan.informatimago.com> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE X-Trace: ger.gmane.org 1428416568 28897 80.91.229.3 (7 Apr 2015 14:22:48 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 7 Apr 2015 14:22:48 +0000 (UTC) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Tue Apr 07 16:22:41 2015 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1YfUOj-0004w5-VJ for geh-help-gnu-emacs@m.gmane.org; Tue, 07 Apr 2015 16:22:38 +0200 Original-Received: from localhost ([::1]:45253 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YfUOj-0005Fq-9Y for geh-help-gnu-emacs@m.gmane.org; Tue, 07 Apr 2015 10:22:37 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:41306) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YfUOU-0005FX-IR for help-gnu-emacs@gnu.org; Tue, 07 Apr 2015 10:22:26 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YfUOO-00033M-Hm for help-gnu-emacs@gnu.org; Tue, 07 Apr 2015 10:22:22 -0400 Original-Received: from mtaout25.012.net.il ([80.179.55.181]:45810) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YfUOO-00032z-5L for help-gnu-emacs@gnu.org; Tue, 07 Apr 2015 10:22:16 -0400 Original-Received: from conversion-daemon.mtaout25.012.net.il by mtaout25.012.net.il (HyperSendmail v2007.08) id <0NMF00N00WQWNT00@mtaout25.012.net.il> for help-gnu-emacs@gnu.org; Tue, 07 Apr 2015 17:17:37 +0300 (IDT) Original-Received: from HOME-C4E4A596F7 ([87.69.4.28]) by mtaout25.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0NMF00N8OX1DGK10@mtaout25.012.net.il> for help-gnu-emacs@gnu.org; Tue, 07 Apr 2015 17:17:37 +0300 (IDT) In-reply-to: X-012-Sender: halo1@inter.net.il X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 80.179.55.181 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:103557 Archived-At: > From: J=FCrgen Hartmann > Date: Tue, 7 Apr 2015 15:55:48 +0200 >=20 > Thank you Pascal Bourguignon for your explanation: >=20 > > ... > >=20 > > (mapcar 'multibyte-string-p (list "\xBA" (concat '(#xBA)))) > > --> (nil t) > >=20 > > string-equal (and therefore string=3D) don't ignore the multibyte= property > > of a string. >=20 > So it's all about the multibyte property? It's about the tricky relationships between unibyte and multibyte strings. May I ask why you need to mess with unibyte strings? (Your original message doesn't seem to present a real problem, just something that puzzled you.) > > Now, it's hard to say how to "solve" this problem, basically, you= asked > > for it: "\xBA" is not a valid way to write a string containing ma= sculine > > ordinal. >=20 > In seams that one can use "\u00BA" to achieve this in a string cons= tant; it > evaluates to a multibyte string containing the integer 186: >=20 > "\u00BA" > --> "=BA" Why can't you simply use the =BA character? why do you need to use it= s codepoint? > (multibyte-string-p "\u00BA") > --> t >=20 > (append "\u00BA" ()) > --> (186) >=20 > I found it very surprising, that it is not only the escape sequence= s > (characters) in the string constant that determine its multibyte pr= operty, > but it is also the other way round: The sequence \x yields > different results depending on the multibyte property of the string= constant > it is used in. For example the constant "\x3FFFBA" is an unibyte st= ring > containing the integer 186: >=20 > "\x3FFFBA" > --> "\272" "Contains" is incorrect here. That constant _represents_ a raw byte whose value is 186. Emacs goes out of its way under the hood to show you 186 when the buffer or string contains 0x3FFFBA. >=20 > (multibyte-string-p "\x3FFFBA") > --> nil >=20 > (append "\x3FFFBA" ()) > --> (186) >=20 > The constant "\x3FFFBA =C4" on the other hand is a mulibyte string = in which the > sequence \x3FFFBA yields the integer 4194234: >=20 > "\x3FFFBA =C4" > --> "\272 =C4" >=20 > (multibyte-string-p "\x3FFFBA =C4") > --> t >=20 > (append "\x3FFFBA =C4" ()) > --> (4194234 32 196) >=20 > This seems to be an undocumented feature. It's barely documented in the node "Text Representations" in the ELis= p manual. This is a tricky issue, so you are well advised to stay away of unibyte strings as much as you can, for your sanity's sake. > In this respect it is interesting to compare another pair of string= s: "A" and > (substring "A=C4" 0 1). Both of them contain the same integer, name= ly 65, and are > printed as "A"--they only differ in their multibyte property: The f= ormer is > an unibyte string, the latter multibyte: Don't try to learn about unibyte/multibyte strings using ASCII characters as examples, because ASCII is treated specially for obviou= s reasons. > > (On the other hand, one might argue that having both unibyte and > > multibyte strings in a lisp implementation is not a good idea, an= d > > there's the opportunity for a big refactoring and simplification)= . Hear, hear! > To illustrate this, consider the strings "A" and (substring "A=C4" = 0 1) from > above. They have the same integer content, only differ in their mul= tibyte > property and compare equal. Yes, and therefore you don't need to consider the multibyte property. > If we just change their integer values--in both strings alike--from= 65 to > 186, we get the pair "\xBA" and (concat '(#xBA)), that we also disc= ussed > before. Also here the only difference lies in the multibyte propert= y, while > the integer values are the same. But this time the strings compare = different. As they should: you are comparing a character with a raw byte. > One might say that this is not surprising, because this time the in= tegers are > interpreted as different characters. But this would be in contradic= tion to > the definition of the term character according to which a character= actually > _is_ that integer (cf. lisp manual, section "2.3.3 Character Type")= . It is an integer, but note that no one told you anywhere that a raw byte is a character. It's a raw byte. > Does we come to the limit of the definition of what a character is? >=20 > But this gets pretty philosophical. For the practical purpose you h= elped me > a lot and I think that I got some better feeling for this topic. I'd still suggest that you try as much as you can not to use unibyte strings in your Lisp applications. That way lies madness.