unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Design decision of string in Emacs
@ 2020-12-16 13:12 Zhu Zihao
  2020-12-16 14:56 ` Stefan Monnier
  2020-12-16 16:16 ` Eli Zaretskii
  0 siblings, 2 replies; 3+ messages in thread
From: Zhu Zihao @ 2020-12-16 13:12 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 2313 bytes --]


Recently I'm surfing on Emacs China forum and see a weird question[1]

```
(string-bytes (concat (symbol-name 'GET) (encode-coding-string "我" 'utf-8)))
;; => 9

(string-bytes (concat (symbol-name 'GET) (encode-coding-string "foo" 'utf-8)))
;; => 6

(string-bytes (concat "GET" (encode-coding-string "我" 'utf-8)))
;; => 6
```

While concatenating string return from `symbol-name` and encoded CJK
characters, the result bytes are longer than expected.

Curiosity drives me to do some research on this. After reading a lot
manual and source code(mule-conf.el, lread.c) and some experiment made by myself.

My conclusion is:

1. While concatenating unibyte string between multibyte string, Emacs will
convert bytes to eight-bit char in #x3FFF80..#x3FFFFF.

2. symbol-name return a multibyte string, because symbol name should
always be "multibyte string" but not bytes, so even symbol name only
contains ASCII characters, Emacs will mark it as multibyte string.

3. string constructed by reader, will first assume it's a unibyte
string, if reader encounters any multibyte char, then mark it as
multibyte string, that's why (string-bytes (concat "GET"
(encode-coding-string "我" 'utf-8))) returns 6 because Emacs consider
this is a concat between two unibyte string.

IMO, multibyte string in Emacs is like "string", unibyte string is like
a vector of u8 number. 

In some language, bytes and strings are different types and they can't
be concat without conversion. And attempts to convert invalid bytes to a
string will throw an error. But Emacs extends Unicode charset to tolerate
these malformed bytes.

I'm interesting on following points.

1. Why Emacs use same type to represent both bytes and string? Putting
them in different type(if we have a time-machine) may be much clearer
and avoid some confusion

2. Why Emacs extend Unicode charset to hold single eight-bit? I don't
know if there's any pratical use.

3. Is there any existing best pratice in manipulating strings and bytes?
If there's none. We may discuss and record it to Elisp manual.


[1]: https://emacs-china.org/t/concat-symbol-name-get-encode-coding-string-utf-8-bytes/15350

-- 
Retrieve my PGP public key:

  gpg --recv-keys D47A9C8B2AE3905B563D9135BE42B352A9F6821F

Zihao

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 255 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-12-16 16:16 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-16 13:12 Design decision of string in Emacs Zhu Zihao
2020-12-16 14:56 ` Stefan Monnier
2020-12-16 16:16 ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).