Design decision of string in Emacs

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Design decision of string in Emacs
@ 2020-12-16 13:12 Zhu Zihao
  2020-12-16 14:56 ` Stefan Monnier
  2020-12-16 16:16 ` Eli Zaretskii
  0 siblings, 2 replies; 3+ messages in thread
From: Zhu Zihao @ 2020-12-16 13:12 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 2313 bytes --]

Recently I'm surfing on Emacs China forum and see a weird question[1]

```
(string-bytes (concat (symbol-name 'GET) (encode-coding-string "我" 'utf-8)))
;; => 9

(string-bytes (concat (symbol-name 'GET) (encode-coding-string "foo" 'utf-8)))
;; => 6

(string-bytes (concat "GET" (encode-coding-string "我" 'utf-8)))
;; => 6
```

While concatenating string return from `symbol-name` and encoded CJK
characters, the result bytes are longer than expected.

Curiosity drives me to do some research on this. After reading a lot
manual and source code(mule-conf.el, lread.c) and some experiment made by myself.

My conclusion is:

1. While concatenating unibyte string between multibyte string, Emacs will
convert bytes to eight-bit char in #x3FFF80..#x3FFFFF.

2. symbol-name return a multibyte string, because symbol name should
always be "multibyte string" but not bytes, so even symbol name only
contains ASCII characters, Emacs will mark it as multibyte string.

3. string constructed by reader, will first assume it's a unibyte
string, if reader encounters any multibyte char, then mark it as
multibyte string, that's why (string-bytes (concat "GET"
(encode-coding-string "我" 'utf-8))) returns 6 because Emacs consider
this is a concat between two unibyte string.

IMO, multibyte string in Emacs is like "string", unibyte string is like
a vector of u8 number. 

In some language, bytes and strings are different types and they can't
be concat without conversion. And attempts to convert invalid bytes to a
string will throw an error. But Emacs extends Unicode charset to tolerate
these malformed bytes.

I'm interesting on following points.

1. Why Emacs use same type to represent both bytes and string? Putting
them in different type(if we have a time-machine) may be much clearer
and avoid some confusion

2. Why Emacs extend Unicode charset to hold single eight-bit? I don't
know if there's any pratical use.

3. Is there any existing best pratice in manipulating strings and bytes?
If there's none. We may discuss and record it to Elisp manual.

[1]: https://emacs-china.org/t/concat-symbol-name-get-encode-coding-string-utf-8-bytes/15350

-- 
Retrieve my PGP public key:

  gpg --recv-keys D47A9C8B2AE3905B563D9135BE42B352A9F6821F

Zihao

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 255 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Design decision of string in Emacs
  2020-12-16 13:12 Design decision of string in Emacs Zhu Zihao
@ 2020-12-16 14:56 ` Stefan Monnier
  2020-12-16 16:16 ` Eli Zaretskii
  1 sibling, 0 replies; 3+ messages in thread
From: Stefan Monnier @ 2020-12-16 14:56 UTC (permalink / raw)
  To: Zhu Zihao; +Cc: emacs-devel

> ```
> (string-bytes (concat (symbol-name 'GET) (encode-coding-string "我" 'utf-8)))
> ;; => 9
>
> (string-bytes (concat (symbol-name 'GET) (encode-coding-string "foo" 'utf-8)))
> ;; => 6
>
> (string-bytes (concat "GET" (encode-coding-string "我" 'utf-8)))
> ;; => 6
> ```

Oh, you're looking at the ugly mess we still have under the carpet, huh?

[ Based on the rest of what you wrote I gather than you did figure out
  what's going on: congratulations!  ]

> 1. Why Emacs use same type to represent both bytes and string? Putting
> them in different type(if we have a time-machine) may be much clearer
> and avoid some confusion

Emacs started with 8-bit characters, so there was no good reason to
distinguish sequences of bytes from sequences of characters.
When support for larger character sets was introduced (in MULE), the
need to work with existing ELisp code made it necessary to be very
permissive w.r.t confusions between chars and bytes.

This lead to introducing 2 types (unibyte and multibyte strings) but
pretending as hard as possible that it's still just a single type.
Also when MULE was merged into the official version of Emacs, the
original focus was in trying to avoid regressions, so it was important
to automatically treat bytes as "iso-8859-1 chars" and vice versa, like
the old Emacs used to do.

Over time, we have made the distinction a bit more strong, introducing
a few more checks and signaling a few more errors, but we're still very
much in the "DWIM" world.  A big reason for that is that there's no
distinction (in the printed representation) between unibyte and
multibyte for strings which only contain ASCII.

In my local/personal Emacs branch, I tried to improve this (to try and
avoid the kind of inconsistency you show in your example above, for
example) by treating "ASCII strings" specially, considering them to be
both unibyte and multibyte at the same time.  It kinda works, but it's
not clear it's a sufficient improvement to justify the (minor) backward
incompatibility it introduces.

> 2. Why Emacs extend Unicode charset to hold single eight-bit?
> I don't know if there's any pratical use.

Ah, that question is much simpler: when reading a file labeled as using
utf-8 bytes, we need to handle the case where the content is actually
not valid utf-8.  We could just signal an error and refuse to read the
file, but we decided instead to make it possible to read and edit such
files by representing (in the buffer) the invalid byte sequences using
those special "eight-bit byte characters".  This way, you can edit
a "mostly utf-8 file with some invalid byte sequences" just fine and
those invalid byte sequences will be properly preserved when you save
the file.

Of course, that is used also for other encodings than utf-8.

> 3. Is there any existing best pratice in manipulating strings and bytes?
> If there's none. We may discuss and record it to Elisp manual.

Not really.  My own (very general) recommendation is to try and remember
that unibyte strings are sequences of bytes while multibyte strings are
sequences of characters and to try and keep it very clear in your head
when you're manipulating bytes and when you're manipulating characters.

        Stefan

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Design decision of string in Emacs
  2020-12-16 13:12 Design decision of string in Emacs Zhu Zihao
  2020-12-16 14:56 ` Stefan Monnier
@ 2020-12-16 16:16 ` Eli Zaretskii
  1 sibling, 0 replies; 3+ messages in thread
From: Eli Zaretskii @ 2020-12-16 16:16 UTC (permalink / raw)
  To: Zhu Zihao; +Cc: emacs-devel

> From: Zhu Zihao <all_but_last@163.com>
> Date: Wed, 16 Dec 2020 21:12:41 +0800
> 
> (string-bytes (concat (symbol-name 'GET) (encode-coding-string "我" 'utf-8)))
> ;; => 9

To avoid confusion like this, always encode _last_:

 (string-bytes (encode-coding-string (concat (symbol-name 'GET) "我") 'utf-8))
   => 6



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-12-16 16:16 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-16 13:12 Design decision of string in Emacs Zhu Zihao
2020-12-16 14:56 ` Stefan Monnier
2020-12-16 16:16 ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).