From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Philipp Stephani Newsgroups: gmane.emacs.devel Subject: Re: [PATCH] Allow inserting non-BMP characters Date: Tue, 26 Dec 2017 18:50:46 +0000 Message-ID: References: <20171225210115.13789-1-phst@google.com> <83d132hz9e.fsf@gnu.org> <834lodii55.fsf@gnu.org> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="94eb2c03457e63e1fa056142c39a" X-Trace: blaine.gmane.org 1514314195 7375 195.159.176.226 (26 Dec 2017 18:49:55 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Tue, 26 Dec 2017 18:49:55 +0000 (UTC) Cc: phst@google.com, emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Dec 26 19:49:50 2017 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eTuIO-0001XP-32 for ged-emacs-devel@m.gmane.org; Tue, 26 Dec 2017 19:49:48 +0100 Original-Received: from localhost ([::1]:52604 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eTuKM-0004oH-Pj for ged-emacs-devel@m.gmane.org; Tue, 26 Dec 2017 13:51:50 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:57518) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eTuJa-0004mK-NH for emacs-devel@gnu.org; Tue, 26 Dec 2017 13:51:04 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eTuJY-00038C-Ou for emacs-devel@gnu.org; Tue, 26 Dec 2017 13:51:02 -0500 Original-Received: from mail-qt0-x22f.google.com ([2607:f8b0:400d:c0d::22f]:45578) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1eTuJW-00035n-Dq; Tue, 26 Dec 2017 13:50:58 -0500 Original-Received: by mail-qt0-x22f.google.com with SMTP id g10so45704373qtj.12; Tue, 26 Dec 2017 10:50:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=rJ0srZN3tfAwMYIvXL7N19PdE8LKLXLGw5e14vBMO0o=; b=WRv5H205BREuP4bMBjo5jkiqkMi/xycqJ2Uhp8XAUQ/3205iDE0xdBQ/A7aDhSPVNj lRgd4OMVvFhYsa8HgPx2RWldRnZ+9AlER+0G8ro1QInscpsFgyabcWN+Mkt/dYGuD6vo QuGl/Os7in0bbk5E29Nr9bLdJ4BjbMtvqsQXI/gx6ZMhslxYJfclfFnXv41BLqytI5hQ JYjrtMLTqX95UmL5QTPdpN0Xktq8HbXhBjmxcHmbtqdN3aNFGK5W/T4n/SlZMRDTcYHj g+DLABJle83n67zeRcj7vMTXvjCO8psqhASzvcHvoEbMNue/04vEG3CxrMIY2RFDq9CS NKpQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=rJ0srZN3tfAwMYIvXL7N19PdE8LKLXLGw5e14vBMO0o=; b=AZSZiwrlFjG6PYwQH6+tMnFZAjfmJdNqIpMRqWX+3qknOsB5kRNm6+X4qAJ7Do5rBa MOXlhKfXjzDLZHpQJxOLXUsTzthGBmTVKKx/FZNvpHHz5Kz0VQdjGJHNQApefGoIEK2c iMyHmBsHwesEYAeWw1kNVGXDM6qfRH46z55zrrbcFgoNOIvi0ZmVunzQTkmtMK3jpSdg UTleqXyyJaFRSYhHQlmu59O06uc/QM4E6jbr2S/nyPUEqxxNhHq9IyoVmjv0ilQ0wZSW HookR8TE9iMUOGcP0HrqfX3fHwd/peS0gvWjPjTF80YhdmBRvwR+23s3IA3M3OiY+sBJ 8Ikg== X-Gm-Message-State: AKGB3mJYDaXqHttBqlmPNqNo7LkfnKS6VuA30CK8ueDdC/ANvzLJ8vtY RozIW6YdUSRzrAl7YRZLDkzzi8XJTBfPwJuo1NHZCg== X-Google-Smtp-Source: ACJfBos8bVGPC16PIsE1jcBptmkdQ8Q+o3as/QyDOONB+aUxhAWHEbCo6N0jdBfqQios9bHUWKKGCnaeg5OCWfb18iw= X-Received: by 10.200.27.225 with SMTP id m30mr37451787qtk.260.1514314257390; Tue, 26 Dec 2017 10:50:57 -0800 (PST) In-Reply-To: <834lodii55.fsf@gnu.org> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2607:f8b0:400d:c0d::22f X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:221420 Archived-At: --94eb2c03457e63e1fa056142c39a Content-Type: text/plain; charset="UTF-8" Eli Zaretskii schrieb am Di., 26. Dez. 2017 um 17:11 Uhr: > > From: Philipp Stephani > > Date: Tue, 26 Dec 2017 10:35:42 +0000 > > Cc: emacs-devel@gnu.org, phst@google.com > > > > Suggest to move surrogates_to_codepoint to coding.c, and then use the > > macros UTF_16_HIGH_SURROGATE_P and UTF_16_LOW_SURROGATE_P defined > > there. > > > > Hmm, I'd rather go the other way round and remove these macros later. > They are macros, thus worse than > > functions, > > I don't think we have a policy to prefer inline functions to macros, > and I don't think we should have such a policy. We use inline > functions when that's necessary, but we don't in general prefer them. > They have their own problems, see the comments in lisp.h for some of > that. > Thanks, the only discussion I saw there was about some performance issues: Some operations are so commonly executed that they are implemented as macros, not functions, because otherwise runtime performance would suffer too much when compiling with GCC without optimization. FIXME: Remove the lisp_h_OP macros, and define just the inline OP functions, once "gcc -Og" (new to GCC 4.8) works well enough for Emacs developers. Maybe in the year 2020. See Bug#11935. That bug says that GCC 4.8 should be available in Debian stable to support -Og. Debian stable now already has 6.3 ( https://wiki.debian.org/DebianStretch#Packages_.26_versions). So maybe it's time to compile development versions with -Og and try to reapply the original patch. 5 years have passed, and compilers have gotten a lot better. In any case, the new functions are certainly not commonly executed. They are currently only executed when converting non-BMP keystrokes on macOS, which is rare enough. > > > and don't seem to be correct either (what about a value such as > 0x11DC00?). > > ??? They care correct for UTF-16 sequences, which are 16-bit numbers. > If you need to augment them by testing the high-order bits to be zero > in your case, that's okay, but I don't see any need for introducing > similar but different functionality. > I'd be OK with using the macros since they already exist, but I wouldn't want to touch them without converting them to functions first, and for using them in nsterm.m I'd have to move them around. > > > No new macros please if we can avoid it. Functions are strictly better. > > Sorry, I disagree. Each has its advantages, and on balance I find > macros to be slightly better, certainly not worse. There's no need to > avoid them in C. > I disagree, see e.g. https://gcc.gnu.org/onlinedocs/cpp/Macro-Pitfalls.html and many other sources. Sometimes macros are unavoidable, but not here. > > > I don't care much whether they are in character.h or coding.h, but > char_surrogate_p is already in character.h. > > char_surrogate_p should have used the coding.h macros as well. > > > > + USE_SAFE_ALLOCA; > > > + unichar *utf16_buffer; > > > + SAFE_NALLOCA (utf16_buffer, 1, len); > > > > Maximum length of a UTF-16 sequence is known in advance, so why do you > > need SAFE_NALLOCA here? Couldn't you use a buffer of fixed length > > instead? > > > > The text being inserted can be arbitrarily long. Even single characters > (i.e. extended grapheme clusters) can > > be arbitrarily long. > > Yes, but why do you first copy the input into a separate buffer? Why > not convert each UTF-16 sequence separately, as you go through the > loop? > Message (method) invocations in Objective-C have high overhead because they are late-bound. Therefore it is advisable to minimize the number of messages sent. https://developer.apple.com/documentation/foundation/nsstring/1408720-getcharacters?language=objc also indicates that a (properly implemented) getCharacters call is faster than calling characterAtIndex in a loop. --94eb2c03457e63e1fa056142c39a Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


Eli Za= retskii <eliz@gnu.org> schrieb am= Di., 26. Dez. 2017 um 17:11=C2=A0Uhr:
> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Tue, 26 Dec 2017 10:35:42 +0000
> Cc: emacs-dev= el@gnu.org, phst@g= oogle.com
>
>=C2=A0 Suggest to move surrogates_to_codepoint to coding.c, and then us= e the
>=C2=A0 macros UTF_16_HIGH_SURROGATE_P and UTF_16_LOW_SURROGATE_P define= d
>=C2=A0 there.
>
> Hmm, I'd rather go the other way round and remove these macros lat= er. They are macros, thus worse than
> functions,

I don't think we have a policy to prefer inline functions to macros, and I don't think we should have such a policy.=C2=A0 We use inline
functions when that's necessary, but we don't in general prefer the= m.
They have their own problems, see the comments in lisp.h for some of
that.

Thanks, the only discussion I saw= there was about some performance issues:

=C2= =A0 =C2=A0Some operations are so commonly executed that they are implemente= d
=C2=A0 =C2=A0as macros, not functions, because otherwise runtim= e performance would
=C2=A0 =C2=A0suffer too much when compiling w= ith GCC without optimization.

=C2=A0 = =C2=A0FIXME: Remove the lisp_h_OP macros, and define just the inline OP
=C2=A0 =C2=A0functions, once "gcc -Og" (new to GCC 4.8) wo= rks well enough for
=C2=A0 =C2=A0Emacs developers.=C2=A0 Maybe in= the year 2020.=C2=A0 See Bug#11935.

That bu= g says that GCC 4.8 should be available in Debian stable to support -Og. De= bian stable now already has 6.3 (https://wiki.debian.org/DebianStretch#Package= s_.26_versions). So maybe it's time to compile development versions= with -Og and try to reapply the original patch. 5 years have passed, and c= ompilers have gotten a lot better.

In any case, th= e new functions are certainly not commonly executed. They are currently onl= y executed when converting non-BMP keystrokes on macOS, which is rare enoug= h.

=C2=A0

> and don't seem to be correct either (what about a value such as 0x= 11DC00?).

??? They care correct for UTF-16 sequences, which are 16-bit numbers.
If you need to augment them by testing the high-order bits to be zero
in your case, that's okay, but I don't see any need for introducing=
similar but different functionality.

I&= #39;d be OK with using the macros since they already exist, but I wouldn= 9;t want to touch them without converting them to functions first, and for = using them in nsterm.m I'd have to move them around.
=C2=A0

> No new macros please if we can avoid it. Functions are strictly better= .

Sorry, I disagree.=C2=A0 Each has its advantages, and on balance I find
macros to be slightly better, certainly not worse.=C2=A0 There's no nee= d to
avoid them in C.

I disagree, see e.g.= =C2=A0ht= tps://gcc.gnu.org/onlinedocs/cpp/Macro-Pitfalls.html=C2=A0and many othe= r sources.
Sometimes macros are unavoidable, but not here.
<= div>=C2=A0

> I don't care much whether they are in character.h or coding.h, but= char_surrogate_p is already in character.h.

char_surrogate_p should have used the coding.h macros as well.

>=C2=A0 > +=C2=A0 USE_SAFE_ALLOCA;
>=C2=A0 > +=C2=A0 unichar *utf16_buffer;
>=C2=A0 > +=C2=A0 SAFE_NALLOCA (utf16_buffer, 1, len);
>
>=C2=A0 Maximum length of a UTF-16 sequence is known in advance, so why = do you
>=C2=A0 need SAFE_NALLOCA here?=C2=A0 Couldn't you use a buffer of f= ixed length
>=C2=A0 instead?
>
> The text being inserted can be arbitrarily long. Even single character= s (i.e. extended grapheme clusters) can
> be arbitrarily long.

Yes, but why do you first copy the input into a separate buffer?=C2=A0 Why<= br> not convert each UTF-16 sequence separately, as you go through the
loop?

Message (method) invocations in O= bjective-C have high overhead because they are late-bound. Therefore it is = advisable to minimize the number of messages sent.=C2=A0https://developer.apple.com/documentation/foundation/nsst= ring/1408720-getcharacters?language=3Dobjc=C2=A0also indicates that a (= properly implemented) getCharacters call is faster than calling characterAt= Index in a loop.
--94eb2c03457e63e1fa056142c39a--