From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: Is copy_string_contents in emacs-module.h give us a proper UTF-8 string? Date: Thu, 08 Oct 2020 10:38:13 +0300 Message-ID: <83blhd88wq.fsf@gnu.org> References: <79bc0bc2.1982.17506d48b4d.Coremail.all_but_last@163.com> Mime-Version: 1.0 Content-Type: text/plain; charset=gbk Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="8983"; mail-complaints-to="usenet@ciao.gmane.io" Cc: emacs-devel@gnu.org To: "Zhu Zihao" Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Thu Oct 08 09:38:35 2020 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1kQQVX-0002Fv-LD for ged-emacs-devel@m.gmane-mx.org; Thu, 08 Oct 2020 09:38:35 +0200 Original-Received: from localhost ([::1]:35904 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kQQVW-0001cq-OC for ged-emacs-devel@m.gmane-mx.org; Thu, 08 Oct 2020 03:38:34 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:60772) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kQQV4-0001BY-7Y for emacs-devel@gnu.org; Thu, 08 Oct 2020 03:38:06 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:60058) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kQQV3-0007Sz-Qx; Thu, 08 Oct 2020 03:38:05 -0400 Original-Received: from [176.228.60.248] (port=2111 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1kQQV3-0008Au-Cu; Thu, 08 Oct 2020 03:38:05 -0400 In-Reply-To: <79bc0bc2.1982.17506d48b4d.Coremail.all_but_last@163.com> X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:257199 Archived-At: > Date: Thu, 8 Oct 2020 14:09:53 +0800 (CST) > From: "Zhu Zihao" > > To support this multitude of characters and scripts, Emacs closely > follows the ¡°Unicode Standard¡±. The Unicode Standard assigns a unique > number, called a ¡°codepoint¡±, to each and every character. The range of > codepoints defined by Unicode, or the Unicode ¡°codespace¡±, is > ¡®0..#x10FFFF¡¯ (in hexadecimal notation), inclusive. Emacs extends this > range with codepoints in the range ¡®#x110000..#x3FFFFF¡¯, which it uses > for representing characters that are not unified with Unicode and ¡°raw > 8-bit bytes¡± that cannot be interpreted as characters. Thus, a > character codepoint in Emacs is a 22-bit integer. > > Will "copy_string_contents" always give us a proper UTF-8 string. Or it will give us a mix of bytevector and > UTF8? If the original string includes raw bytes, copy_string_contents will signal an error.