From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!.POSTED!not-for-mail
From: Philipp Stephani <p.stephani2@gmail.com>
Newsgroups: gmane.emacs.devel
Subject: Re: [PATCH] Allow inserting non-BMP characters
Date: Tue, 26 Dec 2017 18:50:46 +0000
Message-ID: <CAArVCkRdDWHv2uiB4nmrXAv6L9a76iimCde+dYrb1LwQnwh76Q@mail.gmail.com>
References: <CAArVCkRx8p_vaFKJ_kXRuoZCKVBSYr=94RJANGpU0NXvkEZv6A@mail.gmail.com>
	<20171225210115.13789-1-phst@google.com> <83d132hz9e.fsf@gnu.org>
	<CAArVCkSMeQcjxz0CCsjaOU55e7g=AwsE+dU9LDCajye6JzujeA@mail.gmail.com>
	<834lodii55.fsf@gnu.org>
NNTP-Posting-Host: blaine.gmane.org
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="94eb2c03457e63e1fa056142c39a"
X-Trace: blaine.gmane.org 1514314195 7375 195.159.176.226 (26 Dec 2017 18:49:55 GMT)
X-Complaints-To: usenet@blaine.gmane.org
NNTP-Posting-Date: Tue, 26 Dec 2017 18:49:55 +0000 (UTC)
Cc: phst@google.com, emacs-devel@gnu.org
To: Eli Zaretskii <eliz@gnu.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Dec 26 19:49:50 2017
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by blaine.gmane.org with esmtp (Exim 4.84_2)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1eTuIO-0001XP-32
	for ged-emacs-devel@m.gmane.org; Tue, 26 Dec 2017 19:49:48 +0100
Original-Received: from localhost ([::1]:52604 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1eTuKM-0004oH-Pj
	for ged-emacs-devel@m.gmane.org; Tue, 26 Dec 2017 13:51:50 -0500
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:57518)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <p.stephani2@gmail.com>) id 1eTuJa-0004mK-NH
	for emacs-devel@gnu.org; Tue, 26 Dec 2017 13:51:04 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <p.stephani2@gmail.com>) id 1eTuJY-00038C-Ou
	for emacs-devel@gnu.org; Tue, 26 Dec 2017 13:51:02 -0500
Original-Received: from mail-qt0-x22f.google.com ([2607:f8b0:400d:c0d::22f]:45578)
	by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16)
	(Exim 4.71) (envelope-from <p.stephani2@gmail.com>)
	id 1eTuJW-00035n-Dq; Tue, 26 Dec 2017 13:50:58 -0500
Original-Received: by mail-qt0-x22f.google.com with SMTP id g10so45704373qtj.12;
	Tue, 26 Dec 2017 10:50:58 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
	h=mime-version:references:in-reply-to:from:date:message-id:subject:to
	:cc; bh=rJ0srZN3tfAwMYIvXL7N19PdE8LKLXLGw5e14vBMO0o=;
	b=WRv5H205BREuP4bMBjo5jkiqkMi/xycqJ2Uhp8XAUQ/3205iDE0xdBQ/A7aDhSPVNj
	lRgd4OMVvFhYsa8HgPx2RWldRnZ+9AlER+0G8ro1QInscpsFgyabcWN+Mkt/dYGuD6vo
	QuGl/Os7in0bbk5E29Nr9bLdJ4BjbMtvqsQXI/gx6ZMhslxYJfclfFnXv41BLqytI5hQ
	JYjrtMLTqX95UmL5QTPdpN0Xktq8HbXhBjmxcHmbtqdN3aNFGK5W/T4n/SlZMRDTcYHj
	g+DLABJle83n67zeRcj7vMTXvjCO8psqhASzvcHvoEbMNue/04vEG3CxrMIY2RFDq9CS
	NKpQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:mime-version:references:in-reply-to:from:date
	:message-id:subject:to:cc;
	bh=rJ0srZN3tfAwMYIvXL7N19PdE8LKLXLGw5e14vBMO0o=;
	b=AZSZiwrlFjG6PYwQH6+tMnFZAjfmJdNqIpMRqWX+3qknOsB5kRNm6+X4qAJ7Do5rBa
	MOXlhKfXjzDLZHpQJxOLXUsTzthGBmTVKKx/FZNvpHHz5Kz0VQdjGJHNQApefGoIEK2c
	iMyHmBsHwesEYAeWw1kNVGXDM6qfRH46z55zrrbcFgoNOIvi0ZmVunzQTkmtMK3jpSdg
	UTleqXyyJaFRSYhHQlmu59O06uc/QM4E6jbr2S/nyPUEqxxNhHq9IyoVmjv0ilQ0wZSW
	HookR8TE9iMUOGcP0HrqfX3fHwd/peS0gvWjPjTF80YhdmBRvwR+23s3IA3M3OiY+sBJ
	8Ikg==
X-Gm-Message-State: AKGB3mJYDaXqHttBqlmPNqNo7LkfnKS6VuA30CK8ueDdC/ANvzLJ8vtY
	RozIW6YdUSRzrAl7YRZLDkzzi8XJTBfPwJuo1NHZCg==
X-Google-Smtp-Source: ACJfBos8bVGPC16PIsE1jcBptmkdQ8Q+o3as/QyDOONB+aUxhAWHEbCo6N0jdBfqQios9bHUWKKGCnaeg5OCWfb18iw=
X-Received: by 10.200.27.225 with SMTP id m30mr37451787qtk.260.1514314257390; 
	Tue, 26 Dec 2017 10:50:57 -0800 (PST)
In-Reply-To: <834lodii55.fsf@gnu.org>
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
	recognized.
X-Received-From: 2607:f8b0:400d:c0d::22f
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel/>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: "Emacs-devel" <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Xref: news.gmane.org gmane.emacs.devel:221420
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/221420>

--94eb2c03457e63e1fa056142c39a
Content-Type: text/plain; charset="UTF-8"

Eli Zaretskii <eliz@gnu.org> schrieb am Di., 26. Dez. 2017 um 17:11 Uhr:

> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Tue, 26 Dec 2017 10:35:42 +0000
> > Cc: emacs-devel@gnu.org, phst@google.com
> >
> >  Suggest to move surrogates_to_codepoint to coding.c, and then use the
> >  macros UTF_16_HIGH_SURROGATE_P and UTF_16_LOW_SURROGATE_P defined
> >  there.
> >
> > Hmm, I'd rather go the other way round and remove these macros later.
> They are macros, thus worse than
> > functions,
>
> I don't think we have a policy to prefer inline functions to macros,
> and I don't think we should have such a policy.  We use inline
> functions when that's necessary, but we don't in general prefer them.
> They have their own problems, see the comments in lisp.h for some of
> that.
>

Thanks, the only discussion I saw there was about some performance issues:

   Some operations are so commonly executed that they are implemented
   as macros, not functions, because otherwise runtime performance would
   suffer too much when compiling with GCC without optimization.

   FIXME: Remove the lisp_h_OP macros, and define just the inline OP
   functions, once "gcc -Og" (new to GCC 4.8) works well enough for
   Emacs developers.  Maybe in the year 2020.  See Bug#11935.

That bug says that GCC 4.8 should be available in Debian stable to support
-Og. Debian stable now already has 6.3 (
https://wiki.debian.org/DebianStretch#Packages_.26_versions). So maybe it's
time to compile development versions with -Og and try to reapply the
original patch. 5 years have passed, and compilers have gotten a lot better.

In any case, the new functions are certainly not commonly executed. They
are currently only executed when converting non-BMP keystrokes on macOS,
which is rare enough.


>
> > and don't seem to be correct either (what about a value such as
> 0x11DC00?).
>
> ??? They care correct for UTF-16 sequences, which are 16-bit numbers.
> If you need to augment them by testing the high-order bits to be zero
> in your case, that's okay, but I don't see any need for introducing
> similar but different functionality.
>

I'd be OK with using the macros since they already exist, but I wouldn't
want to touch them without converting them to functions first, and for
using them in nsterm.m I'd have to move them around.


>
> > No new macros please if we can avoid it. Functions are strictly better.
>
> Sorry, I disagree.  Each has its advantages, and on balance I find
> macros to be slightly better, certainly not worse.  There's no need to
> avoid them in C.
>

I disagree, see e.g. https://gcc.gnu.org/onlinedocs/cpp/Macro-Pitfalls.html and
many other sources.
Sometimes macros are unavoidable, but not here.


>
> > I don't care much whether they are in character.h or coding.h, but
> char_surrogate_p is already in character.h.
>
> char_surrogate_p should have used the coding.h macros as well.
>
> >  > +  USE_SAFE_ALLOCA;
> >  > +  unichar *utf16_buffer;
> >  > +  SAFE_NALLOCA (utf16_buffer, 1, len);
> >
> >  Maximum length of a UTF-16 sequence is known in advance, so why do you
> >  need SAFE_NALLOCA here?  Couldn't you use a buffer of fixed length
> >  instead?
> >
> > The text being inserted can be arbitrarily long. Even single characters
> (i.e. extended grapheme clusters) can
> > be arbitrarily long.
>
> Yes, but why do you first copy the input into a separate buffer?  Why
> not convert each UTF-16 sequence separately, as you go through the
> loop?
>

Message (method) invocations in Objective-C have high overhead because they
are late-bound. Therefore it is advisable to minimize the number of
messages sent.
https://developer.apple.com/documentation/foundation/nsstring/1408720-getcharacters?language=objc
also
indicates that a (properly implemented) getCharacters call is faster than
calling characterAtIndex in a loop.

--94eb2c03457e63e1fa056142c39a
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><br><div class=3D"gmail_quote"><div dir=3D"ltr">Eli Za=
retskii &lt;<a href=3D"mailto:eliz@gnu.org">eliz@gnu.org</a>&gt; schrieb am=
 Di., 26. Dez. 2017 um 17:11=C2=A0Uhr:<br></div><blockquote class=3D"gmail_=
quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1=
ex">&gt; From: Philipp Stephani &lt;<a href=3D"mailto:p.stephani2@gmail.com=
" target=3D"_blank">p.stephani2@gmail.com</a>&gt;<br>
&gt; Date: Tue, 26 Dec 2017 10:35:42 +0000<br>
&gt; Cc: <a href=3D"mailto:emacs-devel@gnu.org" target=3D"_blank">emacs-dev=
el@gnu.org</a>, <a href=3D"mailto:phst@google.com" target=3D"_blank">phst@g=
oogle.com</a><br>
&gt;<br>
&gt;=C2=A0 Suggest to move surrogates_to_codepoint to coding.c, and then us=
e the<br>
&gt;=C2=A0 macros UTF_16_HIGH_SURROGATE_P and UTF_16_LOW_SURROGATE_P define=
d<br>
&gt;=C2=A0 there.<br>
&gt;<br>
&gt; Hmm, I&#39;d rather go the other way round and remove these macros lat=
er. They are macros, thus worse than<br>
&gt; functions,<br>
<br>
I don&#39;t think we have a policy to prefer inline functions to macros,<br=
>
and I don&#39;t think we should have such a policy.=C2=A0 We use inline<br>
functions when that&#39;s necessary, but we don&#39;t in general prefer the=
m.<br>
They have their own problems, see the comments in lisp.h for some of<br>
that.<br></blockquote><div><br></div><div>Thanks, the only discussion I saw=
 there was about some performance issues:</div><div><br></div><div><div>=C2=
=A0 =C2=A0Some operations are so commonly executed that they are implemente=
d</div><div>=C2=A0 =C2=A0as macros, not functions, because otherwise runtim=
e performance would</div><div>=C2=A0 =C2=A0suffer too much when compiling w=
ith GCC without optimization.</div></div><div><br></div><div><div>=C2=A0 =
=C2=A0FIXME: Remove the lisp_h_OP macros, and define just the inline OP</di=
v><div>=C2=A0 =C2=A0functions, once &quot;gcc -Og&quot; (new to GCC 4.8) wo=
rks well enough for</div><div>=C2=A0 =C2=A0Emacs developers.=C2=A0 Maybe in=
 the year 2020.=C2=A0 See Bug#11935.</div></div><div><br></div><div>That bu=
g says that GCC 4.8 should be available in Debian stable to support -Og. De=
bian stable now already has 6.3 (<a href=3D"https://wiki.debian.org/DebianS=
tretch#Packages_.26_versions">https://wiki.debian.org/DebianStretch#Package=
s_.26_versions</a>). So maybe it&#39;s time to compile development versions=
 with -Og and try to reapply the original patch. 5 years have passed, and c=
ompilers have gotten a lot better.</div><div><br></div><div>In any case, th=
e new functions are certainly not commonly executed. They are currently onl=
y executed when converting non-BMP keystrokes on macOS, which is rare enoug=
h.</div><div><br></div><div>=C2=A0</div><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
&gt; and don&#39;t seem to be correct either (what about a value such as 0x=
11DC00?).<br>
<br>
??? They care correct for UTF-16 sequences, which are 16-bit numbers.<br>
If you need to augment them by testing the high-order bits to be zero<br>
in your case, that&#39;s okay, but I don&#39;t see any need for introducing=
<br>
similar but different functionality.<br></blockquote><div><br></div><div>I&=
#39;d be OK with using the macros since they already exist, but I wouldn=
9;t want to touch them without converting them to functions first, and for =
using them in nsterm.m I&#39;d have to move them around.</div><div>=C2=A0</=
div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-lef=
t:1px #ccc solid;padding-left:1ex">
<br>
&gt; No new macros please if we can avoid it. Functions are strictly better=
.<br>
<br>
Sorry, I disagree.=C2=A0 Each has its advantages, and on balance I find<br>
macros to be slightly better, certainly not worse.=C2=A0 There&#39;s no nee=
d to<br>
avoid them in C.<br></blockquote><div><br></div><div>I disagree, see e.g.=
=C2=A0<a href=3D"https://gcc.gnu.org/onlinedocs/cpp/Macro-Pitfalls.html">ht=
tps://gcc.gnu.org/onlinedocs/cpp/Macro-Pitfalls.html</a>=C2=A0and many othe=
r sources.</div><div>Sometimes macros are unavoidable, but not here.</div><=
div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8e=
x;border-left:1px #ccc solid;padding-left:1ex">
<br>
&gt; I don&#39;t care much whether they are in character.h or coding.h, but=
 char_surrogate_p is already in character.h.<br>
<br>
char_surrogate_p should have used the coding.h macros as well.<br>
<br>
&gt;=C2=A0 &gt; +=C2=A0 USE_SAFE_ALLOCA;<br>
&gt;=C2=A0 &gt; +=C2=A0 unichar *utf16_buffer;<br>
&gt;=C2=A0 &gt; +=C2=A0 SAFE_NALLOCA (utf16_buffer, 1, len);<br>
&gt;<br>
&gt;=C2=A0 Maximum length of a UTF-16 sequence is known in advance, so why =
do you<br>
&gt;=C2=A0 need SAFE_NALLOCA here?=C2=A0 Couldn&#39;t you use a buffer of f=
ixed length<br>
&gt;=C2=A0 instead?<br>
&gt;<br>
&gt; The text being inserted can be arbitrarily long. Even single character=
s (i.e. extended grapheme clusters) can<br>
&gt; be arbitrarily long.<br>
<br>
Yes, but why do you first copy the input into a separate buffer?=C2=A0 Why<=
br>
not convert each UTF-16 sequence separately, as you go through the<br>
loop?<br></blockquote><div><br></div><div>Message (method) invocations in O=
bjective-C have high overhead because they are late-bound. Therefore it is =
advisable to minimize the number of messages sent.=C2=A0<a href=3D"https://=
developer.apple.com/documentation/foundation/nsstring/1408720-getcharacters=
?language=3Dobjc">https://developer.apple.com/documentation/foundation/nsst=
ring/1408720-getcharacters?language=3Dobjc</a>=C2=A0also indicates that a (=
properly implemented) getCharacters call is faster than calling characterAt=
Index in a loop.</div></div></div>

--94eb2c03457e63e1fa056142c39a--