From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: =?iso-8859-1?B?SvxyZ2VuIEhhcnRtYW5u?= Newsgroups: gmane.emacs.help Subject: [Solved] RE: Differences between identical strings in Emacs lisp Date: Tue, 7 Apr 2015 15:55:48 +0200 Message-ID: References: , <87pp7gu7by.fsf@kuiper.lan.informatimago.com> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1428414978 880 80.91.229.3 (7 Apr 2015 13:56:18 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 7 Apr 2015 13:56:18 +0000 (UTC) To: "help-gnu-emacs@gnu.org" Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Tue Apr 07 15:56:10 2015 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1YfTz7-0005yq-Eu for geh-help-gnu-emacs@m.gmane.org; Tue, 07 Apr 2015 15:56:09 +0200 Original-Received: from localhost ([::1]:45160 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YfTz7-0005Ot-19 for geh-help-gnu-emacs@m.gmane.org; Tue, 07 Apr 2015 09:56:09 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:35572) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YfTyu-0005Mu-6d for help-gnu-emacs@gnu.org; Tue, 07 Apr 2015 09:55:58 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YfTyo-0002X1-71 for help-gnu-emacs@gnu.org; Tue, 07 Apr 2015 09:55:56 -0400 Original-Received: from dub004-omc4s15.hotmail.com ([157.55.2.90]:65156) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YfTyn-0002W9-JE for help-gnu-emacs@gnu.org; Tue, 07 Apr 2015 09:55:50 -0400 Original-Received: from DUB124-W4 ([157.55.2.73]) by DUB004-OMC4S15.hotmail.com over TLS secured channel with Microsoft SMTPSVC(7.5.7601.22751); Tue, 7 Apr 2015 06:55:48 -0700 X-TMN: [MLvpdHdtwKx6348ORNrRQBPJ8UCX3X8d] X-Originating-Email: [juergen_hartmann_@hotmail.com] Importance: Normal In-Reply-To: <87pp7gu7by.fsf@kuiper.lan.informatimago.com> X-OriginalArrivalTime: 07 Apr 2015 13:55:48.0263 (UTC) FILETIME=[8D132B70:01D0713A] X-detected-operating-system: by eggs.gnu.org: Windows 7 or 8 [fuzzy] X-Received-From: 157.55.2.90 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:103556 Archived-At: Thank you Pascal Bourguignon for your explanation:=0A= =0A= > ...=0A= > =0A= > (mapcar 'multibyte-string-p (list "\xBA" (concat '(#xBA))))=0A= > --> (nil t)=0A= > =0A= > string-equal (and therefore string=3D) don't ignore the multibyte propert= y=0A= > of a string.=0A= =0A= So it's all about the multibyte property?=0A= =0A= > You can use:=0A= > =0A= > (mapcar 'string-as-unibyte (list "\xBA" (concat '(#xBA))))=0A= > --> ("\272" "\302\272")=0A= > =0A= > to see the difference.=0A= =0A= I see: "\xBA" stays as it is--a unibyte string containing the raw character= =0A= \272--=2C while the multibyte string (concat '(#xBA)) gets converted in its= =0A= UTF-8 unibyte form.=0A= =0A= > Now=2C it's hard to say how to "solve" this problem=2C basically=2C you a= sked=0A= > for it: "\xBA" is not a valid way to write a string containing masculine= =0A= > ordinal.=0A= =0A= In seams that one can use "\u00BA" to achieve this in a string constant=3B = it=0A= evaluates to a multibyte string containing the integer 186:=0A= =0A= "\u00BA"=0A= --> "=BA"=0A= =0A= (multibyte-string-p "\u00BA")=0A= --> t=0A= =0A= (append "\u00BA" ())=0A= --> (186)=0A= =0A= I found it very surprising=2C that it is not only the escape sequences=0A= (characters) in the string constant that determine its multibyte property= =2C=0A= but it is also the other way round: The sequence \x yields=0A= different results depending on the multibyte property of the string constan= t=0A= it is used in. For example the constant "\x3FFFBA" is an unibyte string=0A= containing the integer 186:=0A= =0A= "\x3FFFBA"=0A= --> "\272"=0A= =0A= (multibyte-string-p "\x3FFFBA")=0A= --> nil=0A= =0A= (append "\x3FFFBA" ())=0A= --> (186)=0A= =0A= The constant "\x3FFFBA =C4" on the other hand is a mulibyte string in which= the=0A= sequence \x3FFFBA yields the integer 4194234:=0A= =0A= "\x3FFFBA =C4"=0A= --> "\272 =C4"=0A= =0A= (multibyte-string-p "\x3FFFBA =C4")=0A= --> t=0A= =0A= (append "\x3FFFBA =C4" ())=0A= --> (4194234 32 196)=0A= =0A= This seems to be an undocumented feature.=0A= =0A= > I guess you could extract back the bytes=2C and recreate the string=0A= > correctly:=0A= > =0A= > (map 'string 'identity (map 'list 'identity "\xBA"))=0A= > --> "=BA"=0A= > =0A= > (string=3D (map 'string 'identity (map 'list 'identity "\xBA"))=0A= > (concat '(#xBA)))=0A= > --> t=0A= =0A= So reassembling the string by means of map 'string results in a string=0A= containing the same integer as "\xBA"=2C namely 186=2C but as a multibyte s= tring=0A= and the according interpretation of its contents?=0A= =0A= In this respect it is interesting to compare another pair of strings: "A" a= nd=0A= (substring "A=C4" 0 1). Both of them contain the same integer=2C namely 65= =2C and are=0A= printed as "A"--they only differ in their multibyte property: The former is= =0A= an unibyte string=2C the latter multibyte:=0A= =0A= "A"=0A= --> "A"=0A= =0A= (multibyte-string-p "A")=0A= --> nil=0A= =0A= (append "A" ())=0A= --> (65)=0A= =0A= and=0A= =0A= (substring "A=C4" 0 1)=0A= --> "A"=0A= =0A= (multibyte-string-p (substring "A=C4" 0 1))=0A= --> t=0A= =0A= (append (substring "A=C4" 0 1) ())=0A= --> (65)=0A= =0A= The point is that they compare equal in spite of their different multibyte= =0A= property:=0A= =0A= (string=3D "A" (substring "A=C4" 0 1))=0A= --> t=0A= =0A= So=2C as you said before: "string-equal (and therefore string=3D) don't ign= ore=0A= the multibyte property of a string". But it seems that it is not this=0A= property per se that makes the difference=2C but the differing interpretati= on=0A= of the strings contents as a result of this property.=0A= =0A= > (On the other hand=2C one might argue that having both unibyte and=0A= > multibyte strings in a lisp implementation is not a good idea=2C and=0A= > there's the opportunity for a big refactoring and simplification).=0A= >=0A= > ...=0A= =0A= At least it makes it hard to keep the concepts clear.=0A= =0A= To illustrate this=2C consider the strings "A" and (substring "A=C4" 0 1) f= rom=0A= above. They have the same integer content=2C only differ in their multibyte= =0A= property and compare equal.=0A= =0A= If we just change their integer values--in both strings alike--from 65 to= =0A= 186=2C we get the pair "\xBA" and (concat '(#xBA))=2C that we also discusse= d=0A= before. Also here the only difference lies in the multibyte property=2C whi= le=0A= the integer values are the same. But this time the strings compare differen= t.=0A= =0A= One might say that this is not surprising=2C because this time the integers= are=0A= interpreted as different characters. But this would be in contradiction to= =0A= the definition of the term character according to which a character actuall= y=0A= _is_ that integer (cf. lisp manual=2C section "2.3.3 Character Type").=0A= =0A= Does we come to the limit of the definition of what a character is?=0A= =0A= But this gets pretty philosophical. For the practical purpose you helped me= =0A= a lot and I think that I got some better feeling for this topic.=0A= =0A= Thank you very much.=0A= =0A= J=FCrgen=0A= =0A= =