unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed
From: Eli Zaretskii <eliz@gnu.org>
To: help-gnu-emacs@gnu.org
Subject: Re: [Solved] RE: Differences between identical strings in Emacs lisp
Date: Tue, 07 Apr 2015 17:22:22 +0300	[thread overview]
Message-ID: <83mw2khvc1.fsf@gnu.org> (raw)
In-Reply-To: <DUB124-W47A919DD324708DB061CCA8FD0@phx.gbl>

> From: Jürgen Hartmann <juergen_hartmann_@hotmail.com>
> Date: Tue, 7 Apr 2015 15:55:48 +0200
> 
> Thank you Pascal Bourguignon for your explanation:
> 
> > ...
> > 
> >     (mapcar 'multibyte-string-p (list "\xBA" (concat '(#xBA))))
> >     --> (nil t)
> > 
> > string-equal (and therefore string=) don't ignore the multibyte property
> > of a string.
> 
> So it's all about the multibyte property?

It's about the tricky relationships between unibyte and multibyte
strings.

May I ask why you need to mess with unibyte strings?  (Your original
message doesn't seem to present a real problem, just something that
puzzled you.)

> > Now, it's hard to say how to "solve" this problem, basically, you asked
> > for it: "\xBA" is not a valid way to write a string containing masculine
> > ordinal.
> 
> In seams that one can use "\u00BA" to achieve this in a string constant; it
> evaluates to a multibyte string containing the integer 186:
> 
>    "\u00BA"
>    --> "º"

Why can't you simply use the º character? why do you need to use its
codepoint?

>    (multibyte-string-p "\u00BA")
>    --> t
> 
>    (append "\u00BA" ())
>    --> (186)
> 
> I found it very surprising, that it is not only the escape sequences
> (characters) in the string constant that determine its multibyte property,
> but it is also the other way round: The sequence \x yields
> different results depending on the multibyte property of the string constant
> it is used in. For example the constant "\x3FFFBA" is an unibyte string
> containing the integer 186:
> 
>    "\x3FFFBA"
>    --> "\272"

"Contains" is incorrect here.  That constant _represents_ a raw byte
whose value is 186.  Emacs goes out of its way under the hood to show
you 186 when the buffer or string contains 0x3FFFBA.

> 
>    (multibyte-string-p "\x3FFFBA")
>    --> nil
> 
>    (append "\x3FFFBA" ())
>    --> (186)
> 
> The constant "\x3FFFBA Ä" on the other hand is a mulibyte string in which the
> sequence \x3FFFBA yields the integer 4194234:
> 
>    "\x3FFFBA Ä"
>    --> "\272 Ä"
> 
>    (multibyte-string-p "\x3FFFBA Ä")
>    --> t
> 
>    (append "\x3FFFBA Ä" ())
>    --> (4194234 32 196)
> 
> This seems to be an undocumented feature.

It's barely documented in the node "Text Representations" in the ELisp
manual.

This is a tricky issue, so you are well advised to stay away of
unibyte strings as much as you can, for your sanity's sake.

> In this respect it is interesting to compare another pair of strings: "A" and
> (substring "AÄ" 0 1). Both of them contain the same integer, namely 65, and are
> printed as "A"--they only differ in their multibyte property: The former is
> an unibyte string, the latter multibyte:

Don't try to learn about unibyte/multibyte strings using ASCII
characters as examples, because ASCII is treated specially for obvious
reasons.

> > (On the other hand, one might argue that having both unibyte and
> > multibyte strings in a lisp implementation is not a good idea, and
> > there's the opportunity for a big refactoring and simplification).

Hear, hear!

> To illustrate this, consider the strings "A" and (substring "AÄ" 0 1) from
> above. They have the same integer content, only differ in their multibyte
> property and compare equal.

Yes, and therefore you don't need to consider the multibyte property.

> If we just change their integer values--in both strings alike--from 65 to
> 186, we get the pair "\xBA" and (concat '(#xBA)), that we also discussed
> before. Also here the only difference lies in the multibyte property, while
> the integer values are the same. But this time the strings compare different.

As they should: you are comparing a character with a raw byte.

> One might say that this is not surprising, because this time the integers are
> interpreted as different characters. But this would be in contradiction to
> the definition of the term character according to which a character actually
> _is_ that integer (cf. lisp manual, section "2.3.3 Character Type").

It is an integer, but note that no one told you anywhere that a raw
byte is a character.  It's a raw byte.

> Does we come to the limit of the definition of what a character is?
> 
> But this gets pretty philosophical. For the practical purpose you helped me
> a lot and I think that I got some better feeling for this topic.

I'd still suggest that you try as much as you can not to use unibyte
strings in your Lisp applications.  That way lies madness.




  reply	other threads:[~2015-04-07 14:22 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <mailman.76.1428326518.904.help-gnu-emacs@gnu.org>
2015-04-07  0:10 ` Differences between identical strings in Emacs lisp Pascal J. Bourguignon
2015-04-07 13:55   ` [Solved] " Jürgen Hartmann
2015-04-07 14:22     ` Eli Zaretskii [this message]
2015-04-07 17:02       ` Jürgen Hartmann
2015-04-07 17:28         ` Eli Zaretskii
2015-04-08 11:01           ` Jürgen Hartmann
2015-04-08 11:59             ` Eli Zaretskii
2015-04-08 12:37               ` Stefan Monnier
2015-04-09 10:38                 ` Jürgen Hartmann
2015-04-09 12:32                   ` Stefan Monnier
2015-04-09 12:45                   ` Eli Zaretskii
2015-04-10  2:35                     ` Richard Wordingham
2015-04-10  4:46                       ` Stefan Monnier
2015-04-10 12:24                         ` Jürgen Hartmann
2015-04-09 10:36               ` Jürgen Hartmann
2015-04-07 18:24         ` Thien-Thi Nguyen
2015-04-09 10:40           ` Jürgen Hartmann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=83mw2khvc1.fsf@gnu.org \
    --to=eliz@gnu.org \
    --cc=help-gnu-emacs@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).