* Re: Differences between identical strings in Emacs lisp [not found] <mailman.76.1428326518.904.help-gnu-emacs@gnu.org> @ 2015-04-07 0:10 ` Pascal J. Bourguignon 2015-04-07 13:55 ` [Solved] " Jürgen Hartmann 0 siblings, 1 reply; 17+ messages in thread From: Pascal J. Bourguignon @ 2015-04-07 0:10 UTC (permalink / raw) To: help-gnu-emacs Jürgen Hartmann <juergen_hartmann_@hotmail.com> writes: > What is the difference between the string represented by the constant "\xBA" > and the result of (concat '(#xBA))? (mapcar 'multibyte-string-p (list "\xBA" (concat '(#xBA)))) --> (nil t) string-equal (and therefore string=) don't ignore the multibyte property of a string. You can use: (mapcar 'string-as-unibyte (list "\xBA" (concat '(#xBA)))) --> ("\272" "\302\272") to see the difference. Now, it's hard to say how to "solve" this problem, basically, you asked for it: "\xBA" is not a valid way to write a string containing masculine ordinal. I guess you could extract back the bytes, and recreate the string correctly: (map 'string 'identity (map 'list 'identity "\xBA")) --> "º" (string= (map 'string 'identity (map 'list 'identity "\xBA")) (concat '(#xBA))) --> t (On the other hand, one might argue that having both unibyte and multibyte strings in a lisp implementation is not a good idea, and there's the opportunity for a big refactoring and simplification). -- __Pascal Bourguignon__ http://www.informatimago.com/ “The factory of the future will have only two employees, a man and a dog. The man will be there to feed the dog. The dog will be there to keep the man from touching the equipment.” -- Carl Bass CEO Autodesk ^ permalink raw reply [flat|nested] 17+ messages in thread
* [Solved] RE: Differences between identical strings in Emacs lisp 2015-04-07 0:10 ` Differences between identical strings in Emacs lisp Pascal J. Bourguignon @ 2015-04-07 13:55 ` Jürgen Hartmann 2015-04-07 14:22 ` Eli Zaretskii 0 siblings, 1 reply; 17+ messages in thread From: Jürgen Hartmann @ 2015-04-07 13:55 UTC (permalink / raw) To: help-gnu-emacs@gnu.org Thank you Pascal Bourguignon for your explanation: > ... > > (mapcar 'multibyte-string-p (list "\xBA" (concat '(#xBA)))) > --> (nil t) > > string-equal (and therefore string=) don't ignore the multibyte property > of a string. So it's all about the multibyte property? > You can use: > > (mapcar 'string-as-unibyte (list "\xBA" (concat '(#xBA)))) > --> ("\272" "\302\272") > > to see the difference. I see: "\xBA" stays as it is--a unibyte string containing the raw character \272--, while the multibyte string (concat '(#xBA)) gets converted in its UTF-8 unibyte form. > Now, it's hard to say how to "solve" this problem, basically, you asked > for it: "\xBA" is not a valid way to write a string containing masculine > ordinal. In seams that one can use "\u00BA" to achieve this in a string constant; it evaluates to a multibyte string containing the integer 186: "\u00BA" --> "º" (multibyte-string-p "\u00BA") --> t (append "\u00BA" ()) --> (186) I found it very surprising, that it is not only the escape sequences (characters) in the string constant that determine its multibyte property, but it is also the other way round: The sequence \x yields different results depending on the multibyte property of the string constant it is used in. For example the constant "\x3FFFBA" is an unibyte string containing the integer 186: "\x3FFFBA" --> "\272" (multibyte-string-p "\x3FFFBA") --> nil (append "\x3FFFBA" ()) --> (186) The constant "\x3FFFBA Ä" on the other hand is a mulibyte string in which the sequence \x3FFFBA yields the integer 4194234: "\x3FFFBA Ä" --> "\272 Ä" (multibyte-string-p "\x3FFFBA Ä") --> t (append "\x3FFFBA Ä" ()) --> (4194234 32 196) This seems to be an undocumented feature. > I guess you could extract back the bytes, and recreate the string > correctly: > > (map 'string 'identity (map 'list 'identity "\xBA")) > --> "º" > > (string= (map 'string 'identity (map 'list 'identity "\xBA")) > (concat '(#xBA))) > --> t So reassembling the string by means of map 'string results in a string containing the same integer as "\xBA", namely 186, but as a multibyte string and the according interpretation of its contents? In this respect it is interesting to compare another pair of strings: "A" and (substring "AÄ" 0 1). Both of them contain the same integer, namely 65, and are printed as "A"--they only differ in their multibyte property: The former is an unibyte string, the latter multibyte: "A" --> "A" (multibyte-string-p "A") --> nil (append "A" ()) --> (65) and (substring "AÄ" 0 1) --> "A" (multibyte-string-p (substring "AÄ" 0 1)) --> t (append (substring "AÄ" 0 1) ()) --> (65) The point is that they compare equal in spite of their different multibyte property: (string= "A" (substring "AÄ" 0 1)) --> t So, as you said before: "string-equal (and therefore string=) don't ignore the multibyte property of a string". But it seems that it is not this property per se that makes the difference, but the differing interpretation of the strings contents as a result of this property. > (On the other hand, one might argue that having both unibyte and > multibyte strings in a lisp implementation is not a good idea, and > there's the opportunity for a big refactoring and simplification). > > ... At least it makes it hard to keep the concepts clear. To illustrate this, consider the strings "A" and (substring "AÄ" 0 1) from above. They have the same integer content, only differ in their multibyte property and compare equal. If we just change their integer values--in both strings alike--from 65 to 186, we get the pair "\xBA" and (concat '(#xBA)), that we also discussed before. Also here the only difference lies in the multibyte property, while the integer values are the same. But this time the strings compare different. One might say that this is not surprising, because this time the integers are interpreted as different characters. But this would be in contradiction to the definition of the term character according to which a character actually _is_ that integer (cf. lisp manual, section "2.3.3 Character Type"). Does we come to the limit of the definition of what a character is? But this gets pretty philosophical. For the practical purpose you helped me a lot and I think that I got some better feeling for this topic. Thank you very much. Jürgen ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Solved] RE: Differences between identical strings in Emacs lisp 2015-04-07 13:55 ` [Solved] " Jürgen Hartmann @ 2015-04-07 14:22 ` Eli Zaretskii 2015-04-07 17:02 ` Jürgen Hartmann 0 siblings, 1 reply; 17+ messages in thread From: Eli Zaretskii @ 2015-04-07 14:22 UTC (permalink / raw) To: help-gnu-emacs > From: Jürgen Hartmann <juergen_hartmann_@hotmail.com> > Date: Tue, 7 Apr 2015 15:55:48 +0200 > > Thank you Pascal Bourguignon for your explanation: > > > ... > > > > (mapcar 'multibyte-string-p (list "\xBA" (concat '(#xBA)))) > > --> (nil t) > > > > string-equal (and therefore string=) don't ignore the multibyte property > > of a string. > > So it's all about the multibyte property? It's about the tricky relationships between unibyte and multibyte strings. May I ask why you need to mess with unibyte strings? (Your original message doesn't seem to present a real problem, just something that puzzled you.) > > Now, it's hard to say how to "solve" this problem, basically, you asked > > for it: "\xBA" is not a valid way to write a string containing masculine > > ordinal. > > In seams that one can use "\u00BA" to achieve this in a string constant; it > evaluates to a multibyte string containing the integer 186: > > "\u00BA" > --> "º" Why can't you simply use the º character? why do you need to use its codepoint? > (multibyte-string-p "\u00BA") > --> t > > (append "\u00BA" ()) > --> (186) > > I found it very surprising, that it is not only the escape sequences > (characters) in the string constant that determine its multibyte property, > but it is also the other way round: The sequence \x yields > different results depending on the multibyte property of the string constant > it is used in. For example the constant "\x3FFFBA" is an unibyte string > containing the integer 186: > > "\x3FFFBA" > --> "\272" "Contains" is incorrect here. That constant _represents_ a raw byte whose value is 186. Emacs goes out of its way under the hood to show you 186 when the buffer or string contains 0x3FFFBA. > > (multibyte-string-p "\x3FFFBA") > --> nil > > (append "\x3FFFBA" ()) > --> (186) > > The constant "\x3FFFBA Ä" on the other hand is a mulibyte string in which the > sequence \x3FFFBA yields the integer 4194234: > > "\x3FFFBA Ä" > --> "\272 Ä" > > (multibyte-string-p "\x3FFFBA Ä") > --> t > > (append "\x3FFFBA Ä" ()) > --> (4194234 32 196) > > This seems to be an undocumented feature. It's barely documented in the node "Text Representations" in the ELisp manual. This is a tricky issue, so you are well advised to stay away of unibyte strings as much as you can, for your sanity's sake. > In this respect it is interesting to compare another pair of strings: "A" and > (substring "AÄ" 0 1). Both of them contain the same integer, namely 65, and are > printed as "A"--they only differ in their multibyte property: The former is > an unibyte string, the latter multibyte: Don't try to learn about unibyte/multibyte strings using ASCII characters as examples, because ASCII is treated specially for obvious reasons. > > (On the other hand, one might argue that having both unibyte and > > multibyte strings in a lisp implementation is not a good idea, and > > there's the opportunity for a big refactoring and simplification). Hear, hear! > To illustrate this, consider the strings "A" and (substring "AÄ" 0 1) from > above. They have the same integer content, only differ in their multibyte > property and compare equal. Yes, and therefore you don't need to consider the multibyte property. > If we just change their integer values--in both strings alike--from 65 to > 186, we get the pair "\xBA" and (concat '(#xBA)), that we also discussed > before. Also here the only difference lies in the multibyte property, while > the integer values are the same. But this time the strings compare different. As they should: you are comparing a character with a raw byte. > One might say that this is not surprising, because this time the integers are > interpreted as different characters. But this would be in contradiction to > the definition of the term character according to which a character actually > _is_ that integer (cf. lisp manual, section "2.3.3 Character Type"). It is an integer, but note that no one told you anywhere that a raw byte is a character. It's a raw byte. > Does we come to the limit of the definition of what a character is? > > But this gets pretty philosophical. For the practical purpose you helped me > a lot and I think that I got some better feeling for this topic. I'd still suggest that you try as much as you can not to use unibyte strings in your Lisp applications. That way lies madness. ^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: [Solved] RE: Differences between identical strings in Emacs lisp 2015-04-07 14:22 ` Eli Zaretskii @ 2015-04-07 17:02 ` Jürgen Hartmann 2015-04-07 17:28 ` Eli Zaretskii 2015-04-07 18:24 ` Thien-Thi Nguyen 0 siblings, 2 replies; 17+ messages in thread From: Jürgen Hartmann @ 2015-04-07 17:02 UTC (permalink / raw) To: help-gnu-emacs@gnu.org Thank you for your comments and your caring advises, Eli Zaretskii: > May I ask why you need to mess with unibyte strings? (Your original > message doesn't seem to present a real problem, just something that > puzzled you.) That's right: I was trying to learn something about the basic Lisp data types and their constants and, as a side effect, trying to understand some of these "cryptic" read and write sequences that one sees in Emacs from time to time. Doing so it was "\xBA" that unnoticeable lured me into the land of the unicode strings. And being there, as you warn below, the confusion started. First I thought that some hidden decoding based on some charsets or coding systems occurs. But now--thanks to Pascal Bourguignon and you--I know the enemy, or at least its name. >> In seams that one can use "\u00BA" to achieve this in a string constant; >> it >> evaluates to a multibyte string containing the integer 186: >> >> "\u00BA" >> --> "º" > > Why can't you simply use the º character? why do you need to use its > codepoint? Of course this would be possible. As said above, the focus here lies in the rather abstract Lisp topic, namely the conversion a hex code-point to a string. >> ... For example the constant "\x3FFFBA" is an unibyte string >> containing the integer 186: >> >> "\x3FFFBA" >> --> "\272" > > "Contains" is incorrect here. That constant _represents_ a raw byte > whose value is 186. Emacs goes out of its way under the hood to show > you 186 when the buffer or string contains 0x3FFFBA. What is the correct parlance here: Is it correct to say that the constant "\x3FFFBA\x3FFFBB\x3FFFBC" is not a string because it does not contain (?) any characters; rather it is just a sequence of raw bytes? >> ... >> This seems to be an undocumented feature. > > It's barely documented in the node "Text Representations" in the ELisp > manual. I knew that, and that the range [#x3FFF80..#x3FFFFF] of code-points is used for the multibyte representation of raw bytes I learned from section "32.3 Converting Text Representations". My surprise concerning the behavior of "\x3FFFBA" refers to the fact, that it is a unibyte string--from the sentence "But beware:..." in section "2.3.8.2 Non-ASCII Characters in Strings" of the ELisp manual I thought it would be different. (But this was just my faulty interpretation.) > This is a tricky issue, so you are well advised to stay away of > unibyte strings as much as you can, for your sanity's sake. It was not my fault--"\xBA" is the bad guy. >> ... > > Don't try to learn about unibyte/multibyte strings using ASCII > characters as examples, because ASCII is treated specially for obvious > reasons. Okay. > ... > > Yes, and therefore you don't need to consider the multibyte property. > >> ... > > As they should: you are comparing a character with a raw byte. > >> ... definition of the term character according to which a character >> actually >> _is_ that integer (cf. lisp manual, section "2.3.3 Character Type"). > > It is an integer, but note that no one told you anywhere that a raw > byte is a character. It's a raw byte. Ah, that seems to be the key: raw bytes are not characters. (Up to now I thought that raw bytes are a special set of characters that have different representations in unibyte and multibyte contexts.) This distinction removes all the apparent ambiguities. In spite of my previous promise not to try to learn something about the unibyte/multibyte topic from ASCII, I shily dare to ask another question in this context (don't beat me): Does the A in the unibyte string "A" represent a character or a raw byte? Or both? In the latter case, is this that special treatment of ASCII you talked about before? > I'd still suggest that you try as much as you can not to use unibyte > strings in your Lisp applications. That way lies madness. I will try to follow that advice--and I hope that it is not too late... So, thank you very much for your enlightening answers. Juergen ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Solved] RE: Differences between identical strings in Emacs lisp 2015-04-07 17:02 ` Jürgen Hartmann @ 2015-04-07 17:28 ` Eli Zaretskii 2015-04-08 11:01 ` Jürgen Hartmann 2015-04-07 18:24 ` Thien-Thi Nguyen 1 sibling, 1 reply; 17+ messages in thread From: Eli Zaretskii @ 2015-04-07 17:28 UTC (permalink / raw) To: help-gnu-emacs > From: Jürgen Hartmann <juergen_hartmann_@hotmail.com> > Date: Tue, 7 Apr 2015 19:02:38 +0200 > > Thank you for your comments and your caring advises, Eli Zaretskii: > > > May I ask why you need to mess with unibyte strings? (Your original > > message doesn't seem to present a real problem, just something that > > puzzled you.) > > That's right: I was trying to learn something about the basic Lisp data types > and their constants and, as a side effect, trying to understand some of these > "cryptic" read and write sequences that one sees in Emacs from time to time. A worthy goal. > First I thought that some hidden decoding based on some charsets or coding > systems occurs. Actually, some sort of "decoding" does occur, albeit perhaps not in the use cases you tried -- Emacs will sometimes silently convert unibyte characters to their locale-dependent multibyte equivalents. This whole area of unibyte strings is replete with dwim-ish hacks and kludges, all in an attempt to do what the user expects. Thus the confusion and the advice to stay away of that gray area. > >> ... For example the constant "\x3FFFBA" is an unibyte string > >> containing the integer 186: > >> > >> "\x3FFFBA" > >> --> "\272" > > > > "Contains" is incorrect here. That constant _represents_ a raw byte > > whose value is 186. Emacs goes out of its way under the hood to show > > you 186 when the buffer or string contains 0x3FFFBA. > > What is the correct parlance here: Is it correct to say that the constant > "\x3FFFBA\x3FFFBB\x3FFFBC" is not a string because it does not contain (?) > any characters; rather it is just a sequence of raw bytes? It's a "unibyte string", which, by definition, contains raw bytes. But it is actually better to say that the raw bytes there are \272 and not \x3FFFBC. The latter is just the representation Emacs uses for the former, Emacs goes out of its way not to show that internal representation to the user. > >> ... definition of the term character according to which a character > >> actually > >> _is_ that integer (cf. lisp manual, section "2.3.3 Character Type"). > > > > It is an integer, but note that no one told you anywhere that a raw > > byte is a character. It's a raw byte. > > Ah, that seems to be the key: raw bytes are not characters. Exactly. > (Up to now I thought that raw bytes are a special set of characters > that have different representations in unibyte and multibyte > contexts.) They _are_ a special "character set", but only in the very technical sense of "character set" in Emacs. By their nature and their properties in Emacs, they are not characters. > In spite of my previous promise not to try to learn something about the > unibyte/multibyte topic from ASCII, I shily dare to ask another question in > this context (don't beat me): Does the A in the unibyte string "A" represent > a character or a raw byte? Or both? In the latter case, is this that special > treatment of ASCII you talked about before? Raw bytes are only those whose value is above 127, so A is a character. For subtle technical reasons (or maybe by some historical accident), a pure-ASCII string is a unibyte string, although it contains characters, not raw bytes. So having a unibyte string does not yet mean you have raw bytes in it. > > I'd still suggest that you try as much as you can not to use unibyte > > strings in your Lisp applications. That way lies madness. > > I will try to follow that advice--and I hope that it is not too late... By far the only valid use case where you need to manipulate unibyte strings of raw bytes is if you need to encode or decode strings by calling encode-coding-region and its ilk. E.g., an application that needs to send base64-encoded text needs first to encode it using whatever coding-system is appropriate, which produces unibyte text containing raw bytes, and then call base64-encode-region to produce the final result. And similarly for decoding such stuff. You will see examples of this in Gnus and Rmail, for example. > So, thank you very much for your enlightening answers. You are welcome. ^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: [Solved] RE: Differences between identical strings in Emacs lisp 2015-04-07 17:28 ` Eli Zaretskii @ 2015-04-08 11:01 ` Jürgen Hartmann 2015-04-08 11:59 ` Eli Zaretskii 0 siblings, 1 reply; 17+ messages in thread From: Jürgen Hartmann @ 2015-04-08 11:01 UTC (permalink / raw) To: help-gnu-emacs@gnu.org Thank you, Eli Zaretskii, for your explanations: >> [About mapping between unibyte and multibyte strings] >> >> First I thought that some hidden decoding based on some charsets or >> coding >> systems occurs. > > Actually, some sort of "decoding" does occur, albeit perhaps not in > the use cases you tried -- Emacs will sometimes silently convert > unibyte characters to their locale-dependent multibyte equivalents. On which occasion such a conversion is done? Has this anything to do with the the charset that is individually defined in language-info-alist for nearly each language environment? > This whole area of unibyte strings is replete with dwim-ish hacks and > kludges, all in an attempt to do what the user expects. Thus the > confusion and the advice to stay away of that gray area. Sounds like the well known design conflict between "behaving smart" and "being straight". >> [About "\x3FFFBA\x3FFFBB\x3FFFBC"] > > It's a "unibyte string", which, by definition, contains raw bytes. > > But it is actually better to say that the raw bytes there are \272 and > not \x3FFFBC. The latter is just the representation Emacs uses for > the former, Emacs goes out of its way not to show that internal > representation to the user. > >> ... >> >> Ah, that seems to be the key: raw bytes are not characters. > > Exactly. Great! Lesson learned. >> [About raw bytes] > > They _are_ a special "character set", but only in the very technical > sense of "character set" in Emacs. By their nature and their > properties in Emacs, they are not characters. > >> [About characters and raw bytes in unibyte context] > > Raw bytes are only those whose value is above 127, so A is a > character. > > For subtle technical reasons (or maybe by some historical accident), a > pure-ASCII string is a unibyte string, although it contains > characters, not raw bytes. So having a unibyte string does not yet > mean you have raw bytes in it. It seems that all my related observations that puzzled me before can be well explained by the strict distinction between characters and raw bytes and the mapping between the latter's integer representations in the range [0x80..0xFF] in an unibyte context and in the range [0x3FFF80..0x3FFFFF] in a multibyte context. > By far the only valid use case where you need to manipulate unibyte > strings of raw bytes is if you need to encode or decode strings by > calling encode-coding-region and its ilk. E.g., an application that > needs to send base64-encoded text needs first to encode it using > whatever coding-system is appropriate, which produces unibyte text > containing raw bytes, and then call base64-encode-region to produce > the final result. And similarly for decoding such stuff. You will > see examples of this in Gnus and Rmail, for example. > >> So, thank you very much for your enlightening answers. > > You are welcome. Thank you very much. Juergen ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Solved] RE: Differences between identical strings in Emacs lisp 2015-04-08 11:01 ` Jürgen Hartmann @ 2015-04-08 11:59 ` Eli Zaretskii 2015-04-08 12:37 ` Stefan Monnier 2015-04-09 10:36 ` Jürgen Hartmann 0 siblings, 2 replies; 17+ messages in thread From: Eli Zaretskii @ 2015-04-08 11:59 UTC (permalink / raw) To: help-gnu-emacs > From: Jürgen Hartmann <juergen_hartmann_@hotmail.com> > Date: Wed, 8 Apr 2015 13:01:16 +0200 > > >> [About mapping between unibyte and multibyte strings] > >> > >> First I thought that some hidden decoding based on some charsets or > >> coding > >> systems occurs. > > > > Actually, some sort of "decoding" does occur, albeit perhaps not in > > the use cases you tried -- Emacs will sometimes silently convert > > unibyte characters to their locale-dependent multibyte equivalents. > > On which occasion such a conversion is done? One example that comes to mind is (insert 160), i.e. when inserting text into a buffer. There are other examples, but I simply don't remember them at the moment. > Has this anything to do with the the charset that is individually > defined in language-info-alist for nearly each language environment? No, I think Emacs converts the value to the character that has the same Unicode codepoint. > It seems that all my related observations that puzzled me before can be well > explained by the strict distinction between characters and raw bytes and the > mapping between the latter's integer representations in the range > [0x80..0xFF] in an unibyte context and in the range [0x3FFF80..0x3FFFFF] in a > multibyte context. Pretty much, yes. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Solved] RE: Differences between identical strings in Emacs lisp 2015-04-08 11:59 ` Eli Zaretskii @ 2015-04-08 12:37 ` Stefan Monnier 2015-04-09 10:38 ` Jürgen Hartmann 2015-04-09 10:36 ` Jürgen Hartmann 1 sibling, 1 reply; 17+ messages in thread From: Stefan Monnier @ 2015-04-08 12:37 UTC (permalink / raw) To: help-gnu-emacs >> > the use cases you tried -- Emacs will sometimes silently convert >> > unibyte characters to their locale-dependent multibyte equivalents. Nowadays this should happen extremely rarely, or never. >> On which occasion such a conversion is done? > One example that comes to mind is (insert 160), i.e. when inserting > text into a buffer. This doesn't do any conversion (although it did, in Emacs<23). 160 is simply taken as the code of the corresponding character in Emacs's character space (which is basically Unicode), hence regardless of locale. If this `insert' is performed inside a unibyte buffer, then this 160 is instead taken to be a the code of a byte. Again, regardless of the locale. AFAIR, the only "dwimish" conversion that still takes place on occasion is between things like #x3FFFBA and #xBA (i.e. between a byte and a character representing that same byte). >> It seems that all my related observations that puzzled me before can be well >> explained by the strict distinction between characters and raw bytes and the >> mapping between the latter's integer representations in the range >> [0x80..0xFF] in an unibyte context and in the range [0x3FFF80..0x3FFFFF] in a >> multibyte context. > Pretty much, yes. Yes, distinguishing bytes (and byte strings/buffers) from chars (and char strings/buffers) is key. Sadly, Emacs doesn't make it easy because the terms used evolved from a time where byte=char and where people were focused too much on the underlying/internal representation (hence the terms "multibyte" vs "unibyte"), plus the fact that too much code relied on byte=char to be able to make a clean design. So when Emacs-20 appeared, it included all kinds of dwimish (and locale-dependent) conversions to try and accommodate incorrect byte=char assumptions. Over time, the design has been significantly cleaned up, but the terminology is still problematic. Stefan ^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: [Solved] RE: Differences between identical strings in Emacs lisp 2015-04-08 12:37 ` Stefan Monnier @ 2015-04-09 10:38 ` Jürgen Hartmann 2015-04-09 12:32 ` Stefan Monnier 2015-04-09 12:45 ` Eli Zaretskii 0 siblings, 2 replies; 17+ messages in thread From: Jürgen Hartmann @ 2015-04-09 10:38 UTC (permalink / raw) To: help-gnu-emacs@gnu.org Thank you for the clarification, Stefan Monnier: >>>> the use cases you tried -- Emacs will sometimes silently convert >>>> unibyte characters to their locale-dependent multibyte equivalents. > > Nowadays this should happen extremely rarely, or never. > >>> On which occasion such a conversion is done? >> One example that comes to mind is (insert 160), i.e. when inserting >> text into a buffer. > > This doesn't do any conversion (although it did, in Emacs<23). > 160 is simply taken as the code of the corresponding character in > Emacs's character space (which is basically Unicode), hence regardless > of locale. > > If this `insert' is performed inside a unibyte buffer, then this 160 is > instead taken to be a the code of a byte. Again, regardless of the locale. So this is comparable to the output of \xA0 in an unibyte string (e.g. in "\xA0\ A") in contrast to the same in a mutibyte string (e.g. in "\xA0 Ä"): The former yields the raw byte \240, the latter a no-break space. > AFAIR, the only "dwimish" conversion that still takes place on occasion > is between things like #x3FFFBA and #xBA (i.e. between a byte and > a character representing that same byte). (*Broad grin*) I think that I appoint this one to my favorite trap. (See my previous post.) >>> It seems that all my related observations that puzzled me before can be well >>> explained by the strict distinction between characters and raw bytes and the >>> mapping between the latter's integer representations in the range >>> [0x80..0xFF] in an unibyte context and in the range [0x3FFF80..0x3FFFFF] in a >>> multibyte context. >> Pretty much, yes. > > Yes, distinguishing bytes (and byte strings/buffers) from chars (and > char strings/buffers) is key. Sadly, Emacs doesn't make it easy because > the terms used evolved from a time where byte=char and where people were > focused too much on the underlying/internal representation (hence the > terms "multibyte" vs "unibyte"), plus the fact that too much code relied > on byte=char to be able to make a clean design. So when Emacs-20 > appeared, it included all kinds of dwimish (and locale-dependent) > conversions to try and accommodate incorrect byte=char assumptions. > Over time, the design has been significantly cleaned up, but the > terminology is still problematic. I could imagine that the step from the equivalence char=byte to char=unicode code point (long(er) integer) is not so difficult. But we have in addition the UTF-8 representation. To what of the two latter--unicode code point (integer, several bytes long) or its UTF-8 representation (sequence of several bytes) does the term "multibyte" refer? Thank you for the insight in the historic background. Juergen ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Solved] RE: Differences between identical strings in Emacs lisp 2015-04-09 10:38 ` Jürgen Hartmann @ 2015-04-09 12:32 ` Stefan Monnier 2015-04-09 12:45 ` Eli Zaretskii 1 sibling, 0 replies; 17+ messages in thread From: Stefan Monnier @ 2015-04-09 12:32 UTC (permalink / raw) To: help-gnu-emacs > I could imagine that the step from the equivalence char=byte to > char=unicode code point (long(er) integer) is not so difficult. But we have > in addition the UTF-8 representation. To what of the two latter--unicode code > point (integer, several bytes long) or its UTF-8 representation (sequence of > several bytes) does the term "multibyte" refer? multibyte refers to "string of characters". These have been represented internally using an iso-2022 encoding until Emacs-22 and since Emacs-23 they're represented internally with a utf-8 encoding. The name comes from the fact that each element can use up more than one byte. But that's just an internal detail that is mostly hidden from Elisp. To turn such a string of characters into a string of bytes you need to use things like encode-coding-(string|buffer), at which point you have to specify which encoding you want to use (e.g. utf-8). Stefan ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Solved] RE: Differences between identical strings in Emacs lisp 2015-04-09 10:38 ` Jürgen Hartmann 2015-04-09 12:32 ` Stefan Monnier @ 2015-04-09 12:45 ` Eli Zaretskii 2015-04-10 2:35 ` Richard Wordingham 1 sibling, 1 reply; 17+ messages in thread From: Eli Zaretskii @ 2015-04-09 12:45 UTC (permalink / raw) To: help-gnu-emacs > From: Jürgen Hartmann <juergen_hartmann_@hotmail.com> > Date: Thu, 9 Apr 2015 12:38:43 +0200 > > > If this `insert' is performed inside a unibyte buffer, then this 160 is > > instead taken to be a the code of a byte. Again, regardless of the locale. > > So this is comparable to the output of \xA0 in an unibyte string > (e.g. in "\xA0\ A") in contrast to the same in a mutibyte string (e.g. in > "\xA0 Ä"): The former yields the raw byte \240, the latter a no-break space. Yes, Emacs tries to treat buffers and strings alike. > I could imagine that the step from the equivalence char=byte to > char=unicode code point (long(er) integer) is not so difficult. The problem with this is that an encoded character could span several bytes, and then how do you call each byte of such a multibyte sequence? You cannot call it a character. > But we have in addition the UTF-8 representation. If you mean the internal representation, then it's a superset of UTF-8, not UTF-8. If you mean the external encoding of text, then UTF-8 is not the only representation, not even the only multibyte representation. There are others, mostly used in Far East, but not only there. Even UTF-16, used natively by MS-Windows, is technically a multibyte representation. > To what of the two latter--unicode code point (integer, several > bytes long) or its UTF-8 representation (sequence of several bytes) > does the term "multibyte" refer? In the context of Emacs, it refers to the internal representation of characters, which is a superset of UTF-8. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Solved] RE: Differences between identical strings in Emacs lisp 2015-04-09 12:45 ` Eli Zaretskii @ 2015-04-10 2:35 ` Richard Wordingham 2015-04-10 4:46 ` Stefan Monnier 0 siblings, 1 reply; 17+ messages in thread From: Richard Wordingham @ 2015-04-10 2:35 UTC (permalink / raw) To: help-gnu-emacs On Thu, 09 Apr 2015 15:45:06 +0300 Eli Zaretskii <eliz@gnu.org> wrote: > Even UTF-16, used natively by MS-Windows, is technically > a multibyte representation. UTF-16 is a multiword representation (Cuneiform takes 4 bytes per character), so there's no 'technically' about it. Richard. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Solved] RE: Differences between identical strings in Emacs lisp 2015-04-10 2:35 ` Richard Wordingham @ 2015-04-10 4:46 ` Stefan Monnier 2015-04-10 12:24 ` Jürgen Hartmann 0 siblings, 1 reply; 17+ messages in thread From: Stefan Monnier @ 2015-04-10 4:46 UTC (permalink / raw) To: help-gnu-emacs >> Even UTF-16, used natively by MS-Windows, is technically >> a multibyte representation. > UTF-16 is a multiword representation (Cuneiform takes 4 bytes per > character), so there's no 'technically' about it. Even the simplest chars like `a' are represented as "multiple bytes", so yes, it's definitely a multibyte representation. Stefan ^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: [Solved] RE: Differences between identical strings in Emacs lisp 2015-04-10 4:46 ` Stefan Monnier @ 2015-04-10 12:24 ` Jürgen Hartmann 0 siblings, 0 replies; 17+ messages in thread From: Jürgen Hartmann @ 2015-04-10 12:24 UTC (permalink / raw) To: help-gnu-emacs@gnu.org Thank you all for the profound explanations and comments, and the very informative glances under the hood. Juergen ^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: [Solved] RE: Differences between identical strings in Emacs lisp 2015-04-08 11:59 ` Eli Zaretskii 2015-04-08 12:37 ` Stefan Monnier @ 2015-04-09 10:36 ` Jürgen Hartmann 1 sibling, 0 replies; 17+ messages in thread From: Jürgen Hartmann @ 2015-04-09 10:36 UTC (permalink / raw) To: help-gnu-emacs@gnu.org Thank you, Eli Zaretskii, for your answer: >> [About mapping between unibyte and multibyte strings] >> >> On which occasion such a conversion is done? > > One example that comes to mind is (insert 160), i.e. when inserting > text into a buffer. There are other examples, but I simply don't > remember them at the moment. > >> Has this anything to do with the the charset that is individually >> defined in language-info-alist for nearly each language environment? > > No, I think Emacs converts the value to the character that has the > same Unicode codepoint. I see. >> It seems that all my related observations that puzzled me before can be well >> explained by the strict distinction between characters and raw bytes and the >> mapping between the latter's integer representations in the range >> [0x80..0xFF] in an unibyte context and in the range [0x3FFF80..0x3FFFFF] in a >> multibyte context. > > Pretty much, yes. Thank you for the confirmation. Juergen ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Solved] RE: Differences between identical strings in Emacs lisp 2015-04-07 17:02 ` Jürgen Hartmann 2015-04-07 17:28 ` Eli Zaretskii @ 2015-04-07 18:24 ` Thien-Thi Nguyen 2015-04-09 10:40 ` Jürgen Hartmann 1 sibling, 1 reply; 17+ messages in thread From: Thien-Thi Nguyen @ 2015-04-07 18:24 UTC (permalink / raw) To: help-gnu-emacs@gnu.org [-- Attachment #1: Type: text/plain, Size: 878 bytes --] () Jürgen Hartmann <juergen_hartmann_@hotmail.com> () Tue, 7 Apr 2015 19:02:38 +0200 But now [...] I know the enemy, or at least its name. If you are referring to unibyte strings, i'd fancy a less adversarial stance, personally. Dealing w/ unibyte is like asking a young child for directions in a strange town. If you understand the answers you receive, you may choose to be charmed. Otherwise, only lost and frustrated (you can still choose to be charmed by your frustrations, i suppose -- just look at all the DENIGRATED-TEXT-EDITOR users out there :-D). -- Thien-Thi Nguyen ----------------------------------------------- (if you're human and you know it) read my lisp: (defun responsep (type via) (case type (technical (eq 'mailing-list via)) ...)) ---------------------------------------------- GPG key: 4C807502 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: [Solved] RE: Differences between identical strings in Emacs lisp 2015-04-07 18:24 ` Thien-Thi Nguyen @ 2015-04-09 10:40 ` Jürgen Hartmann 0 siblings, 0 replies; 17+ messages in thread From: Jürgen Hartmann @ 2015-04-09 10:40 UTC (permalink / raw) To: help-gnu-emacs@gnu.org Thank you, Thien-Thi Nguyen, for your smoothing words: >> But now [...] I know the enemy, or at least its name. > > If you are referring to unibyte strings, i'd fancy a less > adversarial stance, personally. Right you are: One should always belive in the true goodness and virtue of all kinds of strings and think positive. :-) > Dealing w/ unibyte is like > asking a young child for directions in a strange town. If > you understand the answers you receive, you may choose to be > charmed. So you mean that one might attain mild happiness and innocent luck from the interaction with unibyte objects? I am definitely looking forward to it. > Otherwise, only lost and frustrated (you can still > choose to be charmed by your frustrations, i suppose -- just > look at all the DENIGRATED-TEXT-EDITOR users out there :-D). But also those benefit from THE-ONE-EDITOR: It establishes the contrast. But seriously: From the profound advises I received from all of you I learned that there are some traps related to this topic that have to be avoided or to be handled which according caution. So thank you all very much again. Juergen ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2015-04-10 12:24 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <mailman.76.1428326518.904.help-gnu-emacs@gnu.org> 2015-04-07 0:10 ` Differences between identical strings in Emacs lisp Pascal J. Bourguignon 2015-04-07 13:55 ` [Solved] " Jürgen Hartmann 2015-04-07 14:22 ` Eli Zaretskii 2015-04-07 17:02 ` Jürgen Hartmann 2015-04-07 17:28 ` Eli Zaretskii 2015-04-08 11:01 ` Jürgen Hartmann 2015-04-08 11:59 ` Eli Zaretskii 2015-04-08 12:37 ` Stefan Monnier 2015-04-09 10:38 ` Jürgen Hartmann 2015-04-09 12:32 ` Stefan Monnier 2015-04-09 12:45 ` Eli Zaretskii 2015-04-10 2:35 ` Richard Wordingham 2015-04-10 4:46 ` Stefan Monnier 2015-04-10 12:24 ` Jürgen Hartmann 2015-04-09 10:36 ` Jürgen Hartmann 2015-04-07 18:24 ` Thien-Thi Nguyen 2015-04-09 10:40 ` Jürgen Hartmann
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).