From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Zhu Zihao Newsgroups: gmane.emacs.devel Subject: Design decision of string in Emacs Date: Wed, 16 Dec 2020 21:12:41 +0800 Message-ID: <86h7olyjw7.fsf@163.com> Mime-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="1366"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: mu4e 1.4.13; emacs 27.1 To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Wed Dec 16 14:14:42 2020 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1kpWde-0000Ex-Dw for ged-emacs-devel@m.gmane-mx.org; Wed, 16 Dec 2020 14:14:42 +0100 Original-Received: from localhost ([::1]:55178 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kpWdd-0002ji-D6 for ged-emacs-devel@m.gmane-mx.org; Wed, 16 Dec 2020 08:14:41 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:57214) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kpWcD-0002F9-L6 for emacs-devel@gnu.org; Wed, 16 Dec 2020 08:13:14 -0500 Original-Received: from mail-m974.mail.163.com ([123.126.97.4]:37636) by eggs.gnu.org with esmtps (TLS1.2:DHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.90_1) (envelope-from ) id 1kpWc5-00031E-J4 for emacs-devel@gnu.org; Wed, 16 Dec 2020 08:13:11 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=163.com; s=s110527; h=From:Subject:Message-ID:Date:MIME-Version; bh=ec41R B8bI38LJ0KJRu+MkfaEy6s94YEy3aYKchfLfX8=; b=iiloUewlntthaC5bEPjjv OKt8VkHR4oyPTioEOYN2765B20WIIQhK1/yQRP4S3MRMFuim8syDzD5teCG1XL7N u5FLp3xP8MJTWlZI1tVyjc+OEBoLG0PpWHRqkuptGvGbe2k40ODL1HBEpgp3kfff r05jLfJM9xj8ZthL+B0vvA= Original-Received: from asus-laptop (unknown [27.39.88.42]) by smtp4 (Coremail) with SMTP id HNxpCgDXL0zNB9pfWC3ocg--.7999S2; Wed, 16 Dec 2020 21:12:45 +0800 (CST) X-CM-TRANSID: HNxpCgDXL0zNB9pfWC3ocg--.7999S2 X-Coremail-Antispam: 1Uf129KBjvJXoW7tF17AF4fXFy5tw4xXFW5KFg_yoW8tw43pa yYyw1DtF1UA3Z3Arn5ZF1ftrW8KF4rAry5GrWjywn5Za45GFyUWFy7Kr4j9a4UCryxGa4U ZanI9r13Ar15u3DanT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x07j2a0PUUUUU= X-Originating-IP: [27.39.88.42] X-CM-SenderInfo: pdoosuxxwbztlvw6il2tof0z/1tbitA78r1SIjuyL5gAAs6 Received-SPF: pass client-ip=123.126.97.4; envelope-from=all_but_last@163.com; helo=mail-m974.mail.163.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:260982 Archived-At: --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Recently I'm surfing on Emacs China forum and see a weird question[1] ``` (string-bytes (concat (symbol-name 'GET) (encode-coding-string "=E6=88=91" = 'utf-8))) ;; =3D> 9 (string-bytes (concat (symbol-name 'GET) (encode-coding-string "foo" 'utf-8= ))) ;; =3D> 6 (string-bytes (concat "GET" (encode-coding-string "=E6=88=91" 'utf-8))) ;; =3D> 6 ``` While concatenating string return from `symbol-name` and encoded CJK characters, the result bytes are longer than expected. Curiosity drives me to do some research on this. After reading a lot manual and source code(mule-conf.el, lread.c) and some experiment made by m= yself. My conclusion is: 1. While concatenating unibyte string between multibyte string, Emacs will convert bytes to eight-bit char in #x3FFF80..#x3FFFFF. 2. symbol-name return a multibyte string, because symbol name should always be "multibyte string" but not bytes, so even symbol name only contains ASCII characters, Emacs will mark it as multibyte string. 3. string constructed by reader, will first assume it's a unibyte string, if reader encounters any multibyte char, then mark it as multibyte string, that's why (string-bytes (concat "GET" (encode-coding-string "=E6=88=91" 'utf-8))) returns 6 because Emacs consider this is a concat between two unibyte string. IMO, multibyte string in Emacs is like "string", unibyte string is like a vector of u8 number.=20 In some language, bytes and strings are different types and they can't be concat without conversion. And attempts to convert invalid bytes to a string will throw an error. But Emacs extends Unicode charset to tolerate these malformed bytes. I'm interesting on following points. 1. Why Emacs use same type to represent both bytes and string? Putting them in different type(if we have a time-machine) may be much clearer and avoid some confusion 2. Why Emacs extend Unicode charset to hold single eight-bit? I don't know if there's any pratical use. 3. Is there any existing best pratice in manipulating strings and bytes? If there's none. We may discuss and record it to Elisp manual. [1]: https://emacs-china.org/t/concat-symbol-name-get-encode-coding-string-= utf-8-bytes/15350 =2D-=20 Retrieve my PGP public key: gpg --recv-keys D47A9C8B2AE3905B563D9135BE42B352A9F6821F Zihao --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iIsEARYIADMWIQTUepyLKuOQW1Y9kTW+QrNSqfaCHwUCX9oHyRUcYWxsX2J1dF9s YXN0QDE2My5jb20ACgkQvkKzUqn2gh94IwD8DdwTTTvYxBu1gnOoJIxYCswojT97 9w06WufMRpiNPHUBAInqVvq8bBGOB8YU0lMl0K+N0z62+UZeBGfMhWuC3nsM =tNL8 -----END PGP SIGNATURE----- --=-=-=--