Differences between identical strings in Emacs lisp

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* Differences between identical strings in Emacs lisp
@ 2015-04-06 13:21 Jürgen Hartmann
  2015-04-07 21:10 ` Stefan Monnier
  0 siblings, 1 reply; 21+ messages in thread
From: Jürgen Hartmann @ 2015-04-06 13:21 UTC (permalink / raw)
  To: help-gnu-emacs@gnu.org

What is the difference between the string represented by the constant "\xBA"
and the result of (concat '(#xBA))?

Background:

When I start Emacs 24.4 in Linux with the -Q option and the POSIX locale to
have clean conditions, i.e.

   LC_ALL=C emacs -Q

the evaluation of

   "\xBA"

in *scratch* (lisp interaction mode) yields a result
printed as "\272".

In contrast to that, the result of

   (concat '(#xBA))

is printed as "º", i.e. the "masculine ordinal indicator" glyph in double
quotes. The glyph's character is described by the command describe-char as
follows:

-----------------------------------------------------------------------------
             position: 235 of 341 (69%), column: 1
            character: º (displayed as º) (codepoint 186, #o272, #xba)
    preferred charset: unicode (Unicode (ISO10646))
code point in charset: 0xBA
               script: latin
               syntax: _     which means: symbol
             category: .:Base, L:Left-to-right (strong), h:Korean, j:Japanese, l:Latin
             to input: type "C-x 8 RET HEX-CODEPOINT" or "C-x 8 RET NAME"
          buffer code: #xC2 #xBA
            file code: #xC2 #xBA (encoded by coding system nil)
              display: by this font (glyph code)
    xft:-unknown-DejaVu Sans Mono-normal-normal-normal-*-15-*-*-*-m-0-iso10646-1 (#x7C)

Character code properties: customize what to show
  name: MASCULINE ORDINAL INDICATOR
  general-category: Lo (Letter, Other)
  decomposition: (super 111) (super 'o')

There are text properties here:
  face                 font-lock-string-face
  fontified            t

[back]
-----------------------------------------------------------------------------

Obviously the result of (concat '(#xBA)) gets interpreted (decoded) on the
basis of the unicode charset, while "\xBA" is treated as a raw byte.

Comparing these strings directly also shows hat they are different:

   (string= "\xBA" (concat '(#xBA)))

evaluates to nil.

On the other hand, the expressions

   (append "\xBA" ())

and

   (append (concat '(#xBA)) ())

both evaluate to (186), indicating that the strings contain the same
character(s). So they are identical.

How to resolve this contradiction?

Since I could not find a clue in the manuals or via google, any explanation,
idea, hint, link is greatly appreciated.

Juergen

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Differences between identical strings in Emacs lisp
       [not found] <mailman.76.1428326518.904.help-gnu-emacs@gnu.org>
@ 2015-04-07  0:10 ` Pascal J. Bourguignon
  2015-04-07 13:55   ` [Solved] " Jürgen Hartmann
  0 siblings, 1 reply; 21+ messages in thread
From: Pascal J. Bourguignon @ 2015-04-07  0:10 UTC (permalink / raw)
  To: help-gnu-emacs

Jürgen Hartmann <juergen_hartmann_@hotmail.com> writes:

> What is the difference between the string represented by the constant "\xBA"
> and the result of (concat '(#xBA))?

    (mapcar 'multibyte-string-p (list "\xBA" (concat '(#xBA))))
    --> (nil t)

string-equal (and therefore string=) don't ignore the multibyte property
of a string.

You can use:

    (mapcar 'string-as-unibyte  (list "\xBA" (concat '(#xBA))))
    --> ("\272" "\302\272")

to see the difference.

Now, it's hard to say how to "solve" this problem, basically, you asked
for it: "\xBA" is not a valid way to write a string containing masculine
ordinal.

I guess you could extract back the bytes, and recreate the string
correctly:

    (map 'string 'identity (map 'list 'identity "\xBA"))
    --> "º"

    (string= (map 'string 'identity (map 'list 'identity "\xBA"))
             (concat '(#xBA)))
    --> t

(On the other hand, one might argue that having both unibyte and
multibyte strings in a lisp implementation is not a good idea, and
there's the opportunity for a big refactoring and simplification).

-- 
__Pascal Bourguignon__                 http://www.informatimago.com/
“The factory of the future will have only two employees, a man and a
dog. The man will be there to feed the dog. The dog will be there to
keep the man from touching the equipment.” -- Carl Bass CEO Autodesk

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Solved] RE: Differences between identical strings in Emacs lisp
  2015-04-07  0:10 ` Differences between identical strings in Emacs lisp Pascal J. Bourguignon
@ 2015-04-07 13:55   ` Jürgen Hartmann
  2015-04-07 14:22     ` Eli Zaretskii
  0 siblings, 1 reply; 21+ messages in thread
From: Jürgen Hartmann @ 2015-04-07 13:55 UTC (permalink / raw)
  To: help-gnu-emacs@gnu.org

Thank you Pascal Bourguignon for your explanation:

> ...
> 
>     (mapcar 'multibyte-string-p (list "\xBA" (concat '(#xBA))))
>     --> (nil t)
> 
> string-equal (and therefore string=) don't ignore the multibyte property
> of a string.

So it's all about the multibyte property?

> You can use:
> 
>     (mapcar 'string-as-unibyte  (list "\xBA" (concat '(#xBA))))
>     --> ("\272" "\302\272")
> 
> to see the difference.

I see: "\xBA" stays as it is--a unibyte string containing the raw character
\272--, while the multibyte string (concat '(#xBA)) gets converted in its
UTF-8 unibyte form.

> Now, it's hard to say how to "solve" this problem, basically, you asked
> for it: "\xBA" is not a valid way to write a string containing masculine
> ordinal.

In seams that one can use "\u00BA" to achieve this in a string constant; it
evaluates to a multibyte string containing the integer 186:

   "\u00BA"
   --> "º"

   (multibyte-string-p "\u00BA")
   --> t

   (append "\u00BA" ())
   --> (186)

I found it very surprising, that it is not only the escape sequences
(characters) in the string constant that determine its multibyte property,
but it is also the other way round: The sequence \x yields
different results depending on the multibyte property of the string constant
it is used in. For example the constant "\x3FFFBA" is an unibyte string
containing the integer 186:

   "\x3FFFBA"
   --> "\272"

   (multibyte-string-p "\x3FFFBA")
   --> nil

   (append "\x3FFFBA" ())
   --> (186)

The constant "\x3FFFBA Ä" on the other hand is a mulibyte string in which the
sequence \x3FFFBA yields the integer 4194234:

   "\x3FFFBA Ä"
   --> "\272 Ä"

   (multibyte-string-p "\x3FFFBA Ä")
   --> t

   (append "\x3FFFBA Ä" ())
   --> (4194234 32 196)

This seems to be an undocumented feature.

> I guess you could extract back the bytes, and recreate the string
> correctly:
> 
>     (map 'string 'identity (map 'list 'identity "\xBA"))
>     --> "º"
> 
>     (string= (map 'string 'identity (map 'list 'identity "\xBA"))
>              (concat '(#xBA)))
>     --> t

So reassembling the string by means of map 'string results in a string
containing the same integer as "\xBA", namely 186, but as a multibyte string
and the according interpretation of its contents?

In this respect it is interesting to compare another pair of strings: "A" and
(substring "AÄ" 0 1). Both of them contain the same integer, namely 65, and are
printed as "A"--they only differ in their multibyte property: The former is
an unibyte string, the latter multibyte:

   "A"
   --> "A"

   (multibyte-string-p "A")
   --> nil

   (append "A" ())
   --> (65)

and

   (substring "AÄ" 0 1)
   --> "A"

   (multibyte-string-p (substring "AÄ" 0 1))
   --> t

   (append (substring "AÄ" 0 1) ())
   --> (65)

The point is that they compare equal in spite of their different multibyte
property:

   (string= "A" (substring "AÄ" 0 1))
   --> t

So, as you said before: "string-equal (and therefore string=) don't ignore
the multibyte property of a string". But it seems that it is not this
property per se that makes the difference, but the differing interpretation
of the strings contents as a result of this property.

> (On the other hand, one might argue that having both unibyte and
> multibyte strings in a lisp implementation is not a good idea, and
> there's the opportunity for a big refactoring and simplification).
>
> ...

At least it makes it hard to keep the concepts clear.

To illustrate this, consider the strings "A" and (substring "AÄ" 0 1) from
above. They have the same integer content, only differ in their multibyte
property and compare equal.

If we just change their integer values--in both strings alike--from 65 to
186, we get the pair "\xBA" and (concat '(#xBA)), that we also discussed
before. Also here the only difference lies in the multibyte property, while
the integer values are the same. But this time the strings compare different.

One might say that this is not surprising, because this time the integers are
interpreted as different characters. But this would be in contradiction to
the definition of the term character according to which a character actually
_is_ that integer (cf. lisp manual, section "2.3.3 Character Type").

Does we come to the limit of the definition of what a character is?

But this gets pretty philosophical. For the practical purpose you helped me
a lot and I think that I got some better feeling for this topic.

Thank you very much.

Jürgen

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Solved] RE: Differences between identical strings in Emacs lisp
  2015-04-07 13:55   ` [Solved] " Jürgen Hartmann
@ 2015-04-07 14:22     ` Eli Zaretskii
  2015-04-07 17:02       ` Jürgen Hartmann
  0 siblings, 1 reply; 21+ messages in thread
From: Eli Zaretskii @ 2015-04-07 14:22 UTC (permalink / raw)
  To: help-gnu-emacs

> From: Jürgen Hartmann <juergen_hartmann_@hotmail.com>
> Date: Tue, 7 Apr 2015 15:55:48 +0200
> 
> Thank you Pascal Bourguignon for your explanation:
> 
> > ...
> > 
> >     (mapcar 'multibyte-string-p (list "\xBA" (concat '(#xBA))))
> >     --> (nil t)
> > 
> > string-equal (and therefore string=) don't ignore the multibyte property
> > of a string.
> 
> So it's all about the multibyte property?

It's about the tricky relationships between unibyte and multibyte
strings.

May I ask why you need to mess with unibyte strings?  (Your original
message doesn't seem to present a real problem, just something that
puzzled you.)

> > Now, it's hard to say how to "solve" this problem, basically, you asked
> > for it: "\xBA" is not a valid way to write a string containing masculine
> > ordinal.
> 
> In seams that one can use "\u00BA" to achieve this in a string constant; it
> evaluates to a multibyte string containing the integer 186:
> 
>    "\u00BA"
>    --> "º"

Why can't you simply use the º character? why do you need to use its
codepoint?

>    (multibyte-string-p "\u00BA")
>    --> t
> 
>    (append "\u00BA" ())
>    --> (186)
> 
> I found it very surprising, that it is not only the escape sequences
> (characters) in the string constant that determine its multibyte property,
> but it is also the other way round: The sequence \x yields
> different results depending on the multibyte property of the string constant
> it is used in. For example the constant "\x3FFFBA" is an unibyte string
> containing the integer 186:
> 
>    "\x3FFFBA"
>    --> "\272"

"Contains" is incorrect here.  That constant _represents_ a raw byte
whose value is 186.  Emacs goes out of its way under the hood to show
you 186 when the buffer or string contains 0x3FFFBA.

> 
>    (multibyte-string-p "\x3FFFBA")
>    --> nil
> 
>    (append "\x3FFFBA" ())
>    --> (186)
> 
> The constant "\x3FFFBA Ä" on the other hand is a mulibyte string in which the
> sequence \x3FFFBA yields the integer 4194234:
> 
>    "\x3FFFBA Ä"
>    --> "\272 Ä"
> 
>    (multibyte-string-p "\x3FFFBA Ä")
>    --> t
> 
>    (append "\x3FFFBA Ä" ())
>    --> (4194234 32 196)
> 
> This seems to be an undocumented feature.

It's barely documented in the node "Text Representations" in the ELisp
manual.

This is a tricky issue, so you are well advised to stay away of
unibyte strings as much as you can, for your sanity's sake.

> In this respect it is interesting to compare another pair of strings: "A" and
> (substring "AÄ" 0 1). Both of them contain the same integer, namely 65, and are
> printed as "A"--they only differ in their multibyte property: The former is
> an unibyte string, the latter multibyte:

Don't try to learn about unibyte/multibyte strings using ASCII
characters as examples, because ASCII is treated specially for obvious
reasons.

> > (On the other hand, one might argue that having both unibyte and
> > multibyte strings in a lisp implementation is not a good idea, and
> > there's the opportunity for a big refactoring and simplification).

Hear, hear!

> To illustrate this, consider the strings "A" and (substring "AÄ" 0 1) from
> above. They have the same integer content, only differ in their multibyte
> property and compare equal.

Yes, and therefore you don't need to consider the multibyte property.

> If we just change their integer values--in both strings alike--from 65 to
> 186, we get the pair "\xBA" and (concat '(#xBA)), that we also discussed
> before. Also here the only difference lies in the multibyte property, while
> the integer values are the same. But this time the strings compare different.

As they should: you are comparing a character with a raw byte.

> One might say that this is not surprising, because this time the integers are
> interpreted as different characters. But this would be in contradiction to
> the definition of the term character according to which a character actually
> _is_ that integer (cf. lisp manual, section "2.3.3 Character Type").

It is an integer, but note that no one told you anywhere that a raw
byte is a character.  It's a raw byte.

> Does we come to the limit of the definition of what a character is?
> 
> But this gets pretty philosophical. For the practical purpose you helped me
> a lot and I think that I got some better feeling for this topic.

I'd still suggest that you try as much as you can not to use unibyte
strings in your Lisp applications.  That way lies madness.




^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [Solved] RE: Differences between identical strings in Emacs lisp
  2015-04-07 14:22     ` Eli Zaretskii
@ 2015-04-07 17:02       ` Jürgen Hartmann
  2015-04-07 17:28         ` Eli Zaretskii
  2015-04-07 18:24         ` Thien-Thi Nguyen
  0 siblings, 2 replies; 21+ messages in thread
From: Jürgen Hartmann @ 2015-04-07 17:02 UTC (permalink / raw)
  To: help-gnu-emacs@gnu.org

Thank you for your comments and your caring advises, Eli Zaretskii:

> May I ask why you need to mess with unibyte strings?  (Your original
> message doesn't seem to present a real problem, just something that
> puzzled you.)

That's right: I was trying to learn something about the basic Lisp data types
and their constants and, as a side effect, trying to understand some of these
"cryptic" read and write sequences that one sees in Emacs from time to time.
Doing so it was "\xBA" that unnoticeable lured me into the land of the
unicode strings. And being there, as you warn below, the confusion started.

First I thought that some hidden decoding based on some charsets or coding
systems occurs. But now--thanks to Pascal Bourguignon and you--I know the
enemy, or at least its name.

>> In seams that one can use "\u00BA" to achieve this in a string constant;
>> it
>> evaluates to a multibyte string containing the integer 186:
>>
>>    "\u00BA"
>>    --> "º"
>
> Why can't you simply use the º character? why do you need to use its
> codepoint?

Of course this would be possible. As said above, the focus here lies in the
rather abstract Lisp topic, namely the conversion a hex code-point to a
string.

>> ... For example the constant "\x3FFFBA" is an unibyte string
>> containing the integer 186:
>>
>>    "\x3FFFBA"
>>    --> "\272"
>
> "Contains" is incorrect here.  That constant _represents_ a raw byte
> whose value is 186.  Emacs goes out of its way under the hood to show
> you 186 when the buffer or string contains 0x3FFFBA.

What is the correct parlance here: Is it correct to say that the constant
"\x3FFFBA\x3FFFBB\x3FFFBC" is not a string because it does not contain (?)
any characters; rather it is just a sequence of raw bytes?

>> ...
>> This seems to be an undocumented feature.
>
> It's barely documented in the node "Text Representations" in the ELisp
> manual.

I knew that, and that the range [#x3FFF80..#x3FFFFF] of code-points is used
for the multibyte representation of raw bytes I learned from section "32.3
Converting Text Representations". My surprise concerning the behavior of
"\x3FFFBA" refers to the fact, that it is a unibyte string--from the sentence
"But beware:..." in section "2.3.8.2 Non-ASCII Characters in Strings" of the
ELisp manual I thought it would be different. (But this was just my faulty
interpretation.)

> This is a tricky issue, so you are well advised to stay away of
> unibyte strings as much as you can, for your sanity's sake.

It was not my fault--"\xBA" is the bad guy.

>> ...
>
> Don't try to learn about unibyte/multibyte strings using ASCII
> characters as examples, because ASCII is treated specially for obvious
> reasons.

Okay.

> ...
>
> Yes, and therefore you don't need to consider the multibyte property.
>
>> ...
>
> As they should: you are comparing a character with a raw byte.
>
>> ... definition of the term character according to which a character
>> actually
>> _is_ that integer (cf. lisp manual, section "2.3.3 Character Type").
>
> It is an integer, but note that no one told you anywhere that a raw
> byte is a character.  It's a raw byte.

Ah, that seems to be the key: raw bytes are not characters. (Up to now I
thought that raw bytes are a special set of characters that have different
representations in unibyte and multibyte contexts.) This distinction removes
all the apparent ambiguities.

In spite of my previous promise not to try to learn something about the
unibyte/multibyte topic from ASCII, I shily dare to ask another question in
this context (don't beat me): Does the A in the unibyte string "A" represent
a character or a raw byte? Or both? In the latter case, is this that special
treatment of ASCII you talked about before?

> I'd still suggest that you try as much as you can not to use unibyte
> strings in your Lisp applications.  That way lies madness.

I will try to follow that advice--and I hope that it is not too late...

So, thank you very much for your enlightening answers.

Juergen

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Solved] RE: Differences between identical strings in Emacs lisp
  2015-04-07 17:02       ` Jürgen Hartmann
@ 2015-04-07 17:28         ` Eli Zaretskii
  2015-04-08 11:01           ` Jürgen Hartmann
  2015-04-07 18:24         ` Thien-Thi Nguyen
  1 sibling, 1 reply; 21+ messages in thread
From: Eli Zaretskii @ 2015-04-07 17:28 UTC (permalink / raw)
  To: help-gnu-emacs

> From: Jürgen Hartmann <juergen_hartmann_@hotmail.com>
> Date: Tue, 7 Apr 2015 19:02:38 +0200
> 
> Thank you for your comments and your caring advises, Eli Zaretskii:
> 
> > May I ask why you need to mess with unibyte strings?  (Your original
> > message doesn't seem to present a real problem, just something that
> > puzzled you.)
> 
> That's right: I was trying to learn something about the basic Lisp data types
> and their constants and, as a side effect, trying to understand some of these
> "cryptic" read and write sequences that one sees in Emacs from time to time.

A worthy goal.

> First I thought that some hidden decoding based on some charsets or coding
> systems occurs.

Actually, some sort of "decoding" does occur, albeit perhaps not in
the use cases you tried -- Emacs will sometimes silently convert
unibyte characters to their locale-dependent multibyte equivalents.

This whole area of unibyte strings is replete with dwim-ish hacks and
kludges, all in an attempt to do what the user expects.  Thus the
confusion and the advice to stay away of that gray area.

> >> ... For example the constant "\x3FFFBA" is an unibyte string
> >> containing the integer 186:
> >>
> >>    "\x3FFFBA"
> >>    --> "\272"
> >
> > "Contains" is incorrect here.  That constant _represents_ a raw byte
> > whose value is 186.  Emacs goes out of its way under the hood to show
> > you 186 when the buffer or string contains 0x3FFFBA.
> 
> What is the correct parlance here: Is it correct to say that the constant
> "\x3FFFBA\x3FFFBB\x3FFFBC" is not a string because it does not contain (?)
> any characters; rather it is just a sequence of raw bytes?

It's a "unibyte string", which, by definition, contains raw bytes.

But it is actually better to say that the raw bytes there are \272 and
not \x3FFFBC.  The latter is just the representation Emacs uses for
the former, Emacs goes out of its way not to show that internal
representation to the user.

> >> ... definition of the term character according to which a character
> >> actually
> >> _is_ that integer (cf. lisp manual, section "2.3.3 Character Type").
> >
> > It is an integer, but note that no one told you anywhere that a raw
> > byte is a character.  It's a raw byte.
> 
> Ah, that seems to be the key: raw bytes are not characters.

Exactly.

> (Up to now I thought that raw bytes are a special set of characters
> that have different representations in unibyte and multibyte
> contexts.)

They _are_ a special "character set", but only in the very technical
sense of "character set" in Emacs.  By their nature and their
properties in Emacs, they are not characters.

> In spite of my previous promise not to try to learn something about the
> unibyte/multibyte topic from ASCII, I shily dare to ask another question in
> this context (don't beat me): Does the A in the unibyte string "A" represent
> a character or a raw byte? Or both? In the latter case, is this that special
> treatment of ASCII you talked about before?

Raw bytes are only those whose value is above 127, so A is a
character.

For subtle technical reasons (or maybe by some historical accident), a
pure-ASCII string is a unibyte string, although it contains
characters, not raw bytes.  So having a unibyte string does not yet
mean you have raw bytes in it.

> > I'd still suggest that you try as much as you can not to use unibyte
> > strings in your Lisp applications.  That way lies madness.
> 
> I will try to follow that advice--and I hope that it is not too late...

By far the only valid use case where you need to manipulate unibyte
strings of raw bytes is if you need to encode or decode strings by
calling encode-coding-region and its ilk.  E.g., an application that
needs to send base64-encoded text needs first to encode it using
whatever coding-system is appropriate, which produces unibyte text
containing raw bytes, and then call base64-encode-region to produce
the final result.  And similarly for decoding such stuff.  You will
see examples of this in Gnus and Rmail, for example.

> So, thank you very much for your enlightening answers.

You are welcome.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Solved] RE: Differences between identical strings in Emacs lisp
  2015-04-07 17:02       ` Jürgen Hartmann
  2015-04-07 17:28         ` Eli Zaretskii
@ 2015-04-07 18:24         ` Thien-Thi Nguyen
  2015-04-09 10:40           ` Jürgen Hartmann
  1 sibling, 1 reply; 21+ messages in thread
From: Thien-Thi Nguyen @ 2015-04-07 18:24 UTC (permalink / raw)
  To: help-gnu-emacs@gnu.org

[-- Attachment #1: Type: text/plain, Size: 878 bytes --]

() Jürgen Hartmann <juergen_hartmann_@hotmail.com>
() Tue, 7 Apr 2015 19:02:38 +0200

   But now [...] I know the enemy, or at least its name.

If you are referring to unibyte strings, i'd fancy a less
adversarial stance, personally.  Dealing w/ unibyte is like
asking a young child for directions in a strange town.  If
you understand the answers you receive, you may choose to be
charmed.  Otherwise, only lost and frustrated (you can still
choose to be charmed by your frustrations, i suppose -- just
look at all the DENIGRATED-TEXT-EDITOR users out there :-D).

-- 
Thien-Thi Nguyen -----------------------------------------------
  (if you're human and you know it) read my lisp:
    (defun responsep (type via)
      (case type
        (technical (eq 'mailing-list via))
        ...))
---------------------------------------------- GPG key: 4C807502

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Differences between identical strings in Emacs lisp
  2015-04-06 13:21 Jürgen Hartmann
@ 2015-04-07 21:10 ` Stefan Monnier
  2015-04-08 11:02   ` Jürgen Hartmann
  0 siblings, 1 reply; 21+ messages in thread
From: Stefan Monnier @ 2015-04-07 21:10 UTC (permalink / raw)
  To: help-gnu-emacs

> both evaluate to (186), indicating that the strings contain the same
> character(s).  So they are identical.

No: the "\xBA" string does not contain any character, it only contains
bytes (we call such "string of bytes" a "unibyte string" and the usual
"string of characters" is called a "multibyte string").

And yes, the (integer) codes of the bytes of "\xBA" happen to be
identical to the (integer) codes of the characters of (concat '(#xBA)).

So (equal (append "\xBA" nil) (append "º" nil)) is non-nil.
Note that the same applies to: (equal (append "\xBA" nil) (append [#xBA] nil))

        Stefan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [Solved] RE: Differences between identical strings in Emacs lisp
  2015-04-07 17:28         ` Eli Zaretskii
@ 2015-04-08 11:01           ` Jürgen Hartmann
  2015-04-08 11:59             ` Eli Zaretskii
  0 siblings, 1 reply; 21+ messages in thread
From: Jürgen Hartmann @ 2015-04-08 11:01 UTC (permalink / raw)
  To: help-gnu-emacs@gnu.org

Thank you, Eli Zaretskii, for your explanations:

>> [About mapping between unibyte and multibyte strings]
>>
>> First I thought that some hidden decoding based on some charsets or
>> coding
>> systems occurs.
>
> Actually, some sort of "decoding" does occur, albeit perhaps not in
> the use cases you tried -- Emacs will sometimes silently convert
> unibyte characters to their locale-dependent multibyte equivalents.

On which occasion such a conversion is done? Has this anything to do with the
the charset that is individually defined in language-info-alist for nearly
each language environment?

> This whole area of unibyte strings is replete with dwim-ish hacks and
> kludges, all in an attempt to do what the user expects.  Thus the
> confusion and the advice to stay away of that gray area.

Sounds like the well known design conflict between "behaving smart" and
"being straight".

>> [About "\x3FFFBA\x3FFFBB\x3FFFBC"]
>
> It's a "unibyte string", which, by definition, contains raw bytes.
>
> But it is actually better to say that the raw bytes there are \272 and
> not \x3FFFBC.  The latter is just the representation Emacs uses for
> the former, Emacs goes out of its way not to show that internal
> representation to the user.
>
>> ...
>>
>> Ah, that seems to be the key: raw bytes are not characters.
>
> Exactly.

Great! Lesson learned.

>> [About raw bytes]
>
> They _are_ a special "character set", but only in the very technical
> sense of "character set" in Emacs.  By their nature and their
> properties in Emacs, they are not characters.
>
>> [About characters and raw bytes in unibyte context]
>
> Raw bytes are only those whose value is above 127, so A is a
> character.
>
> For subtle technical reasons (or maybe by some historical accident), a
> pure-ASCII string is a unibyte string, although it contains
> characters, not raw bytes.  So having a unibyte string does not yet
> mean you have raw bytes in it.

It seems that all my related observations that puzzled me before can be well
explained by the strict distinction between characters and raw bytes and the
mapping between the latter's integer representations in the range
[0x80..0xFF] in an unibyte context and in the range [0x3FFF80..0x3FFFFF] in a
multibyte context.

> By far the only valid use case where you need to manipulate unibyte
> strings of raw bytes is if you need to encode or decode strings by
> calling encode-coding-region and its ilk.  E.g., an application that
> needs to send base64-encoded text needs first to encode it using
> whatever coding-system is appropriate, which produces unibyte text
> containing raw bytes, and then call base64-encode-region to produce
> the final result.  And similarly for decoding such stuff.  You will
> see examples of this in Gnus and Rmail, for example.
>
>> So, thank you very much for your enlightening answers.
>
> You are welcome.

Thank you very much.

Juergen

 		 	   		  

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Differences between identical strings in Emacs lisp
  2015-04-07 21:10 ` Stefan Monnier
@ 2015-04-08 11:02   ` Jürgen Hartmann
  2015-04-08 13:44     ` Jürgen Hartmann
  0 siblings, 1 reply; 21+ messages in thread
From: Jürgen Hartmann @ 2015-04-08 11:02 UTC (permalink / raw)
  To: help-gnu-emacs@gnu.org

@Stefan Monnier: Thank you for your clarification:

>> both evaluate to (186), indicating that the strings contain the same
>> character(s).  So they are identical.
>
> No: the "\xBA" string does not contain any character, it only contains
> bytes (we call such "string of bytes" a "unibyte string" and the usual
> "string of characters" is called a "multibyte string").

That's very important, in deed--according to the golden rule:
"Clarity of concept requires clarity of terms."

> And yes, the (integer) codes of the bytes of "\xBA" happen to be
> identical to the (integer) codes of the characters of (concat '(#xBA)).
>
> So (equal (append "\xBA" nil) (append "º" nil)) is non-nil.
> Note that the same applies to: (equal (append "\xBA" nil) (append [#xBA] nil))

I think my problem was that I have missed the type--unibyte vs. multibyte--of
my strings, the fact that characters and raw bytes are different things, and
that the (integer) codes of raw bytes gets converted between unibyte and
multibyte contexts. Because of the latter we have equality for example
between "\xBA" and (concat '(#x3FFFBA)):

   (string= "\xBA" (concat '(#x3FFFBA)))
   --> t

Again, thank you for your input.

Juergen

 		 	   		  


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Solved] RE: Differences between identical strings in Emacs lisp
  2015-04-08 11:01           ` Jürgen Hartmann
@ 2015-04-08 11:59             ` Eli Zaretskii
  2015-04-08 12:37               ` Stefan Monnier
  2015-04-09 10:36               ` Jürgen Hartmann
  0 siblings, 2 replies; 21+ messages in thread
From: Eli Zaretskii @ 2015-04-08 11:59 UTC (permalink / raw)
  To: help-gnu-emacs

> From: Jürgen Hartmann <juergen_hartmann_@hotmail.com>
> Date: Wed, 8 Apr 2015 13:01:16 +0200
> 
> >> [About mapping between unibyte and multibyte strings]
> >>
> >> First I thought that some hidden decoding based on some charsets or
> >> coding
> >> systems occurs.
> >
> > Actually, some sort of "decoding" does occur, albeit perhaps not in
> > the use cases you tried -- Emacs will sometimes silently convert
> > unibyte characters to their locale-dependent multibyte equivalents.
> 
> On which occasion such a conversion is done?

One example that comes to mind is (insert 160), i.e. when inserting
text into a buffer.  There are other examples, but I simply don't
remember them at the moment.

> Has this anything to do with the the charset that is individually
> defined in language-info-alist for nearly each language environment?

No, I think Emacs converts the value to the character that has the
same Unicode codepoint.

> It seems that all my related observations that puzzled me before can be well
> explained by the strict distinction between characters and raw bytes and the
> mapping between the latter's integer representations in the range
> [0x80..0xFF] in an unibyte context and in the range [0x3FFF80..0x3FFFFF] in a
> multibyte context.

Pretty much, yes.




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Solved] RE: Differences between identical strings in Emacs lisp
  2015-04-08 11:59             ` Eli Zaretskii
@ 2015-04-08 12:37               ` Stefan Monnier
  2015-04-09 10:38                 ` Jürgen Hartmann
  2015-04-09 10:36               ` Jürgen Hartmann
  1 sibling, 1 reply; 21+ messages in thread
From: Stefan Monnier @ 2015-04-08 12:37 UTC (permalink / raw)
  To: help-gnu-emacs

>> > the use cases you tried -- Emacs will sometimes silently convert
>> > unibyte characters to their locale-dependent multibyte equivalents.

Nowadays this should happen extremely rarely, or never.

>> On which occasion such a conversion is done?
> One example that comes to mind is (insert 160), i.e. when inserting
> text into a buffer.

This doesn't do any conversion (although it did, in Emacs<23).
160 is simply taken as the code of the corresponding character in
Emacs's character space (which is basically Unicode), hence regardless
of locale.

If this `insert' is performed inside a unibyte buffer, then this 160 is
instead taken to be a the code of a byte.  Again, regardless of the locale.

AFAIR, the only "dwimish" conversion that still takes place on occasion
is between things like #x3FFFBA and #xBA (i.e. between a byte and
a character representing that same byte).

>> It seems that all my related observations that puzzled me before can be well
>> explained by the strict distinction between characters and raw bytes and the
>> mapping between the latter's integer representations in the range
>> [0x80..0xFF] in an unibyte context and in the range [0x3FFF80..0x3FFFFF] in a
>> multibyte context.
> Pretty much, yes.

Yes, distinguishing bytes (and byte strings/buffers) from chars (and
char strings/buffers) is key.  Sadly, Emacs doesn't make it easy because
the terms used evolved from a time where byte=char and where people were
focused too much on the underlying/internal representation (hence the
terms "multibyte" vs "unibyte"), plus the fact that too much code relied
on byte=char to be able to make a clean design.  So when Emacs-20
appeared, it included all kinds of dwimish (and locale-dependent)
conversions to try and accommodate incorrect byte=char assumptions.
Over time, the design has been significantly cleaned up, but the
terminology is still problematic.

        Stefan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Differences between identical strings in Emacs lisp
  2015-04-08 11:02   ` Jürgen Hartmann
@ 2015-04-08 13:44     ` Jürgen Hartmann
  0 siblings, 0 replies; 21+ messages in thread
From: Jürgen Hartmann @ 2015-04-08 13:44 UTC (permalink / raw)
  To: help-gnu-emacs@gnu.org

Argh! Writing about it, I did the same mistake again.

Please forget the wrong example in my previous post:

> Because of the latter we have equality for example
> between "\xBA" and (concat '(#x3FFFBA)):
>
>    (string= "\xBA" (concat '(#x3FFFBA)))
>    --> t

Of course "\xBA" and "\x3FFFBA" represent the same raw byte \272 and both in
an unibyte string. Therefore they are trivially equal.

And what makes it even more embarrassing: I already wrote it right in another
post before.

Sorry.

Juergen

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [Solved] RE: Differences between identical strings in Emacs lisp
  2015-04-08 11:59             ` Eli Zaretskii
  2015-04-08 12:37               ` Stefan Monnier
@ 2015-04-09 10:36               ` Jürgen Hartmann
  1 sibling, 0 replies; 21+ messages in thread
From: Jürgen Hartmann @ 2015-04-09 10:36 UTC (permalink / raw)
  To: help-gnu-emacs@gnu.org

Thank you, Eli Zaretskii, for your answer:

>> [About mapping between unibyte and multibyte strings]
>>
>> On which occasion such a conversion is done?
>
> One example that comes to mind is (insert 160), i.e. when inserting
> text into a buffer.  There are other examples, but I simply don't
> remember them at the moment.
>
>> Has this anything to do with the the charset that is individually
>> defined in language-info-alist for nearly each language environment?
>
> No, I think Emacs converts the value to the character that has the
> same Unicode codepoint.

I see.

>> It seems that all my related observations that puzzled me before can be well
>> explained by the strict distinction between characters and raw bytes and the
>> mapping between the latter's integer representations in the range
>> [0x80..0xFF] in an unibyte context and in the range [0x3FFF80..0x3FFFFF] in a
>> multibyte context.
>
> Pretty much, yes.

Thank you for the confirmation.

Juergen

 		 	   		  


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [Solved] RE: Differences between identical strings in Emacs lisp
  2015-04-08 12:37               ` Stefan Monnier
@ 2015-04-09 10:38                 ` Jürgen Hartmann
  2015-04-09 12:32                   ` Stefan Monnier
  2015-04-09 12:45                   ` Eli Zaretskii
  0 siblings, 2 replies; 21+ messages in thread
From: Jürgen Hartmann @ 2015-04-09 10:38 UTC (permalink / raw)
  To: help-gnu-emacs@gnu.org

Thank you for the clarification, Stefan Monnier:

>>>> the use cases you tried -- Emacs will sometimes silently convert
>>>> unibyte characters to their locale-dependent multibyte equivalents.
>
> Nowadays this should happen extremely rarely, or never.
>
>>> On which occasion such a conversion is done?
>> One example that comes to mind is (insert 160), i.e. when inserting
>> text into a buffer.
>
> This doesn't do any conversion (although it did, in Emacs<23).
> 160 is simply taken as the code of the corresponding character in
> Emacs's character space (which is basically Unicode), hence regardless
> of locale.
>
> If this `insert' is performed inside a unibyte buffer, then this 160 is
> instead taken to be a the code of a byte.  Again, regardless of the locale.

So this is comparable to the output of \xA0 in an unibyte string
(e.g. in "\xA0\ A") in contrast to the same in a mutibyte string (e.g. in
"\xA0 Ä"): The former yields the raw byte \240, the latter a no-break space.

> AFAIR, the only "dwimish" conversion that still takes place on occasion
> is between things like #x3FFFBA and #xBA (i.e. between a byte and
> a character representing that same byte).

(*Broad grin*) I think that I appoint this one to my favorite trap. (See my
previous post.)

>>> It seems that all my related observations that puzzled me before can be well
>>> explained by the strict distinction between characters and raw bytes and the
>>> mapping between the latter's integer representations in the range
>>> [0x80..0xFF] in an unibyte context and in the range [0x3FFF80..0x3FFFFF] in a
>>> multibyte context.
>> Pretty much, yes.
>
> Yes, distinguishing bytes (and byte strings/buffers) from chars (and
> char strings/buffers) is key.  Sadly, Emacs doesn't make it easy because
> the terms used evolved from a time where byte=char and where people were
> focused too much on the underlying/internal representation (hence the
> terms "multibyte" vs "unibyte"), plus the fact that too much code relied
> on byte=char to be able to make a clean design.  So when Emacs-20
> appeared, it included all kinds of dwimish (and locale-dependent)
> conversions to try and accommodate incorrect byte=char assumptions.
> Over time, the design has been significantly cleaned up, but the
> terminology is still problematic.

I could imagine that the step from the equivalence char=byte to
char=unicode code point (long(er) integer) is not so difficult. But we have
in addition the UTF-8 representation. To what of the two latter--unicode code
point (integer, several bytes long) or its UTF-8 representation (sequence of
several bytes) does the term "multibyte" refer?

Thank you for the insight in the historic background.

Juergen

 		 	   		  


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [Solved] RE: Differences between identical strings in Emacs lisp
  2015-04-07 18:24         ` Thien-Thi Nguyen
@ 2015-04-09 10:40           ` Jürgen Hartmann
  0 siblings, 0 replies; 21+ messages in thread
From: Jürgen Hartmann @ 2015-04-09 10:40 UTC (permalink / raw)
  To: help-gnu-emacs@gnu.org

Thank you, Thien-Thi Nguyen, for your smoothing words:

>> But now [...] I know the enemy, or at least its name.
>
> If you are referring to unibyte strings, i'd fancy a less
> adversarial stance, personally.

Right you are: One should always belive in the true goodness and virtue of
all kinds of strings and think positive. :-)

> Dealing w/ unibyte is like
> asking a young child for directions in a strange town.  If
> you understand the answers you receive, you may choose to be
> charmed.

So you mean that one might attain mild happiness and innocent luck from the
interaction with unibyte objects? I am definitely looking forward to it.

> Otherwise, only lost and frustrated (you can still
> choose to be charmed by your frustrations, i suppose -- just
> look at all the DENIGRATED-TEXT-EDITOR users out there :-D).

But also those benefit from THE-ONE-EDITOR: It establishes the contrast.

But seriously: From the profound advises I received from all of you I learned
that there are some traps related to this topic that have to be avoided or to
be handled which according caution.

So thank you all very much again.

Juergen

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Solved] RE: Differences between identical strings in Emacs lisp
  2015-04-09 10:38                 ` Jürgen Hartmann
@ 2015-04-09 12:32                   ` Stefan Monnier
  2015-04-09 12:45                   ` Eli Zaretskii
  1 sibling, 0 replies; 21+ messages in thread
From: Stefan Monnier @ 2015-04-09 12:32 UTC (permalink / raw)
  To: help-gnu-emacs

> I could imagine that the step from the equivalence char=byte to
> char=unicode code point (long(er) integer) is not so difficult. But we have
> in addition the UTF-8 representation. To what of the two latter--unicode code
> point (integer, several bytes long) or its UTF-8 representation (sequence of
> several bytes) does the term "multibyte" refer?

multibyte refers to "string of characters".  These have been represented
internally using an iso-2022 encoding until Emacs-22 and since Emacs-23
they're represented internally with a utf-8 encoding.  The name comes
from the fact that each element can use up more than one byte.
But that's just an internal detail that is mostly hidden from Elisp.

To turn such a string of characters into a string of bytes you need to
use things like encode-coding-(string|buffer), at which point you have
to specify which encoding you want to use (e.g. utf-8).

        Stefan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Solved] RE: Differences between identical strings in Emacs lisp
  2015-04-09 10:38                 ` Jürgen Hartmann
  2015-04-09 12:32                   ` Stefan Monnier
@ 2015-04-09 12:45                   ` Eli Zaretskii
  2015-04-10  2:35                     ` Richard Wordingham
  1 sibling, 1 reply; 21+ messages in thread
From: Eli Zaretskii @ 2015-04-09 12:45 UTC (permalink / raw)
  To: help-gnu-emacs

> From: Jürgen Hartmann <juergen_hartmann_@hotmail.com>
> Date: Thu, 9 Apr 2015 12:38:43 +0200
> 
> > If this `insert' is performed inside a unibyte buffer, then this 160 is
> > instead taken to be a the code of a byte.  Again, regardless of the locale.
> 
> So this is comparable to the output of \xA0 in an unibyte string
> (e.g. in "\xA0\ A") in contrast to the same in a mutibyte string (e.g. in
> "\xA0 Ä"): The former yields the raw byte \240, the latter a no-break space.

Yes, Emacs tries to treat buffers and strings alike.

> I could imagine that the step from the equivalence char=byte to
> char=unicode code point (long(er) integer) is not so difficult.

The problem with this is that an encoded character could span several
bytes, and then how do you call each byte of such a multibyte
sequence?  You cannot call it a character.

> But we have in addition the UTF-8 representation.

If you mean the internal representation, then it's a superset of
UTF-8, not UTF-8.  If you mean the external encoding of text, then
UTF-8 is not the only representation, not even the only multibyte
representation.  There are others, mostly used in Far East, but not
only there.  Even UTF-16, used natively by MS-Windows, is technically
a multibyte representation.

> To what of the two latter--unicode code point (integer, several
> bytes long) or its UTF-8 representation (sequence of several bytes)
> does the term "multibyte" refer?

In the context of Emacs, it refers to the internal representation of
characters, which is a superset of UTF-8.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Solved] RE: Differences between identical strings in Emacs lisp
  2015-04-09 12:45                   ` Eli Zaretskii
@ 2015-04-10  2:35                     ` Richard Wordingham
  2015-04-10  4:46                       ` Stefan Monnier
  0 siblings, 1 reply; 21+ messages in thread
From: Richard Wordingham @ 2015-04-10  2:35 UTC (permalink / raw)
  To: help-gnu-emacs

On Thu, 09 Apr 2015 15:45:06 +0300
Eli Zaretskii <eliz@gnu.org> wrote:

> Even UTF-16, used natively by MS-Windows, is technically
> a multibyte representation.

UTF-16 is a multiword representation (Cuneiform takes 4 bytes per
character), so there's no 'technically' about it.

Richard.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Solved] RE: Differences between identical strings in Emacs lisp
  2015-04-10  2:35                     ` Richard Wordingham
@ 2015-04-10  4:46                       ` Stefan Monnier
  2015-04-10 12:24                         ` Jürgen Hartmann
  0 siblings, 1 reply; 21+ messages in thread
From: Stefan Monnier @ 2015-04-10  4:46 UTC (permalink / raw)
  To: help-gnu-emacs

>> Even UTF-16, used natively by MS-Windows, is technically
>> a multibyte representation.
> UTF-16 is a multiword representation (Cuneiform takes 4 bytes per
> character), so there's no 'technically' about it.

Even the simplest chars like `a' are represented as "multiple bytes", so
yes, it's definitely a multibyte representation.


        Stefan




^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [Solved] RE: Differences between identical strings in Emacs lisp
  2015-04-10  4:46                       ` Stefan Monnier
@ 2015-04-10 12:24                         ` Jürgen Hartmann
  0 siblings, 0 replies; 21+ messages in thread
From: Jürgen Hartmann @ 2015-04-10 12:24 UTC (permalink / raw)
  To: help-gnu-emacs@gnu.org

Thank you all for the profound explanations and comments, and the very
informative glances under the hood.

Juergen

 		 	   		  


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2015-04-10 12:24 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <mailman.76.1428326518.904.help-gnu-emacs@gnu.org>
2015-04-07  0:10 ` Differences between identical strings in Emacs lisp Pascal J. Bourguignon
2015-04-07 13:55   ` [Solved] " Jürgen Hartmann
2015-04-07 14:22     ` Eli Zaretskii
2015-04-07 17:02       ` Jürgen Hartmann
2015-04-07 17:28         ` Eli Zaretskii
2015-04-08 11:01           ` Jürgen Hartmann
2015-04-08 11:59             ` Eli Zaretskii
2015-04-08 12:37               ` Stefan Monnier
2015-04-09 10:38                 ` Jürgen Hartmann
2015-04-09 12:32                   ` Stefan Monnier
2015-04-09 12:45                   ` Eli Zaretskii
2015-04-10  2:35                     ` Richard Wordingham
2015-04-10  4:46                       ` Stefan Monnier
2015-04-10 12:24                         ` Jürgen Hartmann
2015-04-09 10:36               ` Jürgen Hartmann
2015-04-07 18:24         ` Thien-Thi Nguyen
2015-04-09 10:40           ` Jürgen Hartmann
2015-04-06 13:21 Jürgen Hartmann
2015-04-07 21:10 ` Stefan Monnier
2015-04-08 11:02   ` Jürgen Hartmann
2015-04-08 13:44     ` Jürgen Hartmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).