From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Evgeny Zajcev Newsgroups: gmane.emacs.devel Subject: Re: emojis and other multi-character glyphs Date: Sun, 26 Dec 2021 13:41:21 +0300 Message-ID: References: <83lf07pt8i.fsf@gnu.org> Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="00000000000047371605d40a3bfa" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="17135"; mail-complaints-to="usenet@ciao.gmane.io" Cc: emacs-devel To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Sun Dec 26 11:42:46 2021 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1n1QzF-0004Gh-Se for ged-emacs-devel@m.gmane-mx.org; Sun, 26 Dec 2021 11:42:45 +0100 Original-Received: from localhost ([::1]:45484 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1n1QzD-0004u6-Tt for ged-emacs-devel@m.gmane-mx.org; Sun, 26 Dec 2021 05:42:43 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:42190) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1n1Qy9-0004D6-Fz for emacs-devel@gnu.org; Sun, 26 Dec 2021 05:41:37 -0500 Original-Received: from [2a00:1450:4864:20::22b] (port=33467 helo=mail-lj1-x22b.google.com) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1n1Qy7-0006a0-MH; Sun, 26 Dec 2021 05:41:37 -0500 Original-Received: by mail-lj1-x22b.google.com with SMTP id v15so21199192ljc.0; Sun, 26 Dec 2021 02:41:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=jhUFGewKcC/xW5L5T9daCQxd3kbuctpZcak2rzh8Sl4=; b=S1MY3XUaA99SLmz5aJxOdN0WkkPUT6vOx+AkpUD3YsX7+twsxwunYCUF3VhvnG/NIR TzIDz1LfQG5vRJJJ8joCqj5nOXS5v6sIjpghRBVhc39fdKCvzSZlO2QP/qwk9+FD10wk GwRF8GFMlBro74kIETGaAysI0eMFvtv+C5xnnouj9swlQo+FaQSN/u9HPJmJCl9obhoC gVZLMYgNQ0uA4lxohcX0/bjobh59uDWocWDaIAdljPvu9jB1/hfRkOpzi8TEd8ys7bTY csCtSkmcAkm0LlUggLVt+KJzg876Xzr2d2ONUkOASo5w+NbLqdtvJOeA58pZv0Pj8wHS MDyw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=jhUFGewKcC/xW5L5T9daCQxd3kbuctpZcak2rzh8Sl4=; b=AfKY/o2oYpJKOJOSS7IwNXwQznQnEXnS9e3jYWKjC0oH0Elif7YD6RnvcW8Qh9lyQ6 r+kW4fwrkz9MXPm1mjcNKK57y+k99x2U4pIw0NQmvFlGrOKRMF5yjMXwH3ywK+NnNu8s MrA+q2rrljlFoem9Wpsqq0OXh7T4H3/TAWYZctSWi7zey1KFuQQPabGgROME+xE5GCZ+ UKcTbeYgByIV8+I2tHIr1cjp6sRv+UO8CUtqhpGCPXHAPN6yZyA0zo3uBkjXbFcvtk9u IkdN17NjXwcncgGlh4v6yf3rxvH8n4J2Qi+tsfoHmOoAcct5NgUBy63bWLPIENEuoVMb Vhog== X-Gm-Message-State: AOAM530ygVsblo2CFqS0BJE/eC2Huotdsar2oiyP7xpPZ1AjTea8nGJ5 UhfzRA0GiXavUJrMbQrBNoHiBuxQXkGY608TRc+4FCs5xwo= X-Google-Smtp-Source: ABdhPJySDfljCgN29pb1mGi6ZI51tLdfOxElwsIAjBUWCr2mcq10fJaNio/TG4iD5TvE/keKWHdNxTwXUkaI+BQ6Exw= X-Received: by 2002:a2e:9548:: with SMTP id t8mr10292565ljh.121.1640515292836; Sun, 26 Dec 2021 02:41:32 -0800 (PST) In-Reply-To: <83lf07pt8i.fsf@gnu.org> X-Host-Lookup-Failed: Reverse DNS lookup failed for 2a00:1450:4864:20::22b (failed) Received-SPF: pass client-ip=2a00:1450:4864:20::22b; envelope-from=lg.zevlg@gmail.com; helo=mail-lj1-x22b.google.com X-Spam_score_int: -12 X-Spam_score: -1.3 X-Spam_bar: - X-Spam_report: (-1.3 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RDNS_NONE=0.793, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:283279 Archived-At: --00000000000047371605d40a3bfa Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable =D0=B2=D1=81, 26 =D0=B4=D0=B5=D0=BA. 2021 =D0=B3. =D0=B2 13:15, Eli Zaretsk= ii : > > From: Evgeny Zajcev > > Date: Sun, 26 Dec 2021 12:43:34 +0300 > > > > There is some inconsistency in naming and behaviour in Emacs master. > > We have `forward-char', `backward-char', `delete-char', > `backward-delete-char' commands. All of them use > > "char" in their names, however, `forward-char' and `backward-char' > treats "char" differently than > > `delete-char' and `backward-delete-char'. > > > > Let me explain. Emacs has support for composed characters to display > multiple characters composed into > > a single glyph. Almost the same is done for multi-character emojis suc= h > as =F0=9F=87=B7=F0=9F=87=BA or =F0=9F=91=A8=E2=80=8D=F0=9F=91=A9=E2=80=8D= =F0=9F=91=A7=E2=80=8D=F0=9F=91=A6 - multiple > > unicode chars are composed into single glyph representing some emoji. > Now, if you put point under > > composed character or emoji and run `forward-char' or `backward-char' i= t > moves point to the whole glyph, > > however, if you run `delete-char' (when point is under composed char) o= r > `backward-delete-char'(when > > point just after the glyph) it will delete only single character from > multiple character representation, so > > pressing `C-d' under =F0=9F=87=B7=F0=9F=87=BA will magically turn Russi= an flag into =F0=9F=87=BA. > This is very misleading behaviour > > especially when invisible characters are used in the emojis > > Emacs had in the past a feature whereby the user could move and delete > by single codepoints in composed character sequences. This feature > was somehow lost. I'm trying for some time to determine how and why > it was lost, and how to restore it. So this issue is known and is in > the works, albeit slowly. > Ah, I see, nice, I'll try to debug this as well to help you > > Maybe introduce "glyph" term meaning graphical representation of chars > sequence, displayed in the buffer > > and operated as a whole thing? > > There's no need for that, because we can provide dwim-ish operation > for existing commands without any new terminology or new commands. > Yeah, if "char" consistency will be restored then there is no need for "glyph" introduction. I just thought that this is some new feature that chars and glyphs are treated differently. > > > And also it will be possible to write something like > `string-glyph-length' to return 1 for "=F0=9F=91=A8=E2=80=8D=F0=9F=91=A9= =E2=80=8D=F0=9F=91=A7=E2=80=8D=F0=9F=91=A6" instead of 7 > > as `length' returns now. > > Why would that be useful? > Sometimes it is useful to know real string length before acting on it. In my case, I use a service that has limitation on number chars it can act on and emojis are counted as single char. Anyway, having something like `emoji' text-property (as analogue to `composition' text property for composed chars) will be very useful for different use-cases --=20 lg --00000000000047371605d40a3bfa Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=
=D0=B2=D1=81, 26 =D0=B4=D0=B5=D0=BA. = 2021 =D0=B3. =D0=B2 13:15, Eli Zaretskii <eliz@gnu.org>:
> From: Evgeny Zajcev <lg.zevlg@gmail.com>
> Date: Sun, 26 Dec 2021 12:43:34 +0300
>
> There is some inconsistency in naming and behaviour in Emacs master. > We have `forward-char', `backward-char', `delete-char', `b= ackward-delete-char' commands.=C2=A0 All of them use
> "char" in their names, however, `forward-char' and `back= ward-char' treats "char" differently than
> `delete-char' and `backward-delete-char'.
>
> Let me explain.=C2=A0 Emacs has support for composed characters to dis= play multiple characters composed into
> a single glyph.=C2=A0 Almost the same is done for multi-character emoj= is such as =F0=9F=87=B7=F0=9F=87=BA or =F0=9F=91=A8=E2=80=8D=F0=9F=91=A9=E2= =80=8D=F0=9F=91=A7=E2=80=8D=F0=9F=91=A6 - multiple
> unicode chars are composed into single glyph representing some emoji.= =C2=A0 Now, if you put point under
> composed character or emoji and run `forward-char' or `backward-ch= ar' it moves point to the whole glyph,
> however, if you run `delete-char' (when point is under composed ch= ar) or `backward-delete-char'(when
> point just after the glyph) it will delete only single character from = multiple character representation, so
> pressing `C-d' under =F0=9F=87=B7=F0=9F=87=BA will magically turn = Russian flag into =F0=9F=87=BA.=C2=A0 This is very misleading behaviour
> especially when invisible characters are used in the emojis

Emacs had in the past a feature whereby the user could move and delete
by single codepoints in composed character sequences.=C2=A0 This feature was somehow lost.=C2=A0 I'm trying for some time to determine how and w= hy
it was lost, and how to restore it.=C2=A0 So this issue is known and is in<= br> the works, albeit slowly.

Ah, I see, ni= ce, I'll try to debug this as well to help you


> Maybe introduce "glyph" term meaning graphical representatio= n of chars sequence, displayed in the buffer
> and operated as a whole thing?

There's no need for that, because we can provide dwim-ish operation
for existing commands without any new terminology or new commands.

Yeah, if "char" consistency will be r= estored then there is no need for "glyph" introduction.=C2=A0 I j= ust thought that this is some new feature that chars and glyphs are treated= differently.
=C2=A0

> And also it will be possible to write something like `string-glyph-len= gth' to return 1 for "=F0=9F=91=A8=E2=80=8D=F0=9F=91=A9=E2=80=8D= =F0=9F=91=A7=E2=80=8D=F0=9F=91=A6" instead of 7
> as `length' returns now.

Why would that be useful?

Sometimes it = is useful to know real string length before acting on it.=C2=A0 In my case,= I use a service that has limitation on number chars it can act on and emoj= is are counted as single char.=C2=A0 Anyway, having something like `emoji&#= 39; text-property (as analogue to `composition' text property for compo= sed chars) will be very useful for different use-cases

-= -
lg
--00000000000047371605d40a3bfa--