unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#27403: 26.0.50; Indentation misalignment with Unicode code points >65535
@ 2017-06-17  6:28 Adam Niederer
  2017-06-17  7:18 ` Andreas Schwab
  2017-06-17  8:05 ` Eli Zaretskii
  0 siblings, 2 replies; 14+ messages in thread
From: Adam Niederer @ 2017-06-17  6:28 UTC (permalink / raw)
  To: 27403

Hello, I believe I've found an indentation issue. To reproduce, start
emacs, create a buffer in js-mode, paste in this code, and press C-x h
TAB to indent the buffer:

let x = /* 👍 */ { foo: 0
                   bar: 0 }

let x = /* ☺ */ { foo: 0
                  bar: 0 }

Both 25.2 and 26.0.50 add one extra space before "bar" in the first
first snippet with U+1F44D THUMBS UP SIGN in the comment, whereas the
second snippet with U+263A WHITE SMILING FACE properly aligns "bar" with
"foo". This appears to happen whenever the character in the comment
needs a surrogate pair.

This issue also happens in python-mode:

"👍", {"a": 2,
       "b": 3}

"☺", {"a":2,
      "b":3}

Interestingly, pressing TAB with one's point on the second line of each
snippet to dedent the line yields a correct result for both symbols:

"👍", {"a": 2,
    "b": 3}

"☺", {"a":2,
    "b":3}

Just in case those Emoji don't make it through the mail properly, the
first snippet in each example contains U+1F44D THUMBS UP SIGN before
the map, and the second snippet contains U+263A WHITE SMILING FACE.

-Adam


In GNU Emacs 26.0.50 (build 1, x86_64-pc-linux-gnu, GTK+ Version 3.22.15)
of 2017-06-17 built on AdamsPC
Repository revision: 49c0ff29c2e0243ba35ec17e3e3af49369be43db
Windowing system distributor 'The X.Org Foundation', version 11.0.11903000
System Description: Arch Linux

Recent messages:
Auto-saving...
20 (#o24, #x14, ?\C-t)
21 (#o25, #x15, ?\C-u)
20 (#o24, #x14, ?\C-t) [2 times]
Undo! [3 times]
20 (#o24, #x14, ?\C-t)
Auto-saving...
mwheel-scroll: Beginning of buffer
Mark set
Auto-saving...done

Configured features:
XPM JPEG TIFF GIF PNG RSVG IMAGEMAGICK SOUND GPM DBUS GCONF GSETTINGS
NOTIFY ACL GNUTLS LIBXML2 FREETYPE M17N_FLT LIBOTF XFT ZLIB
TOOLKIT_SCROLL_BARS GTK3 X11 LIBSYSTEMD

Important settings:
value of $LC_COLLATE: en_US.UTF-8
value of $LANG: en_US.UTF-8
locale-coding-system: utf-8-unix

Major mode: JavaScript

Minor modes in effect:
tooltip-mode: t
global-eldoc-mode: t
electric-indent-mode: t
mouse-wheel-mode: t
tool-bar-mode: t
menu-bar-mode: t
file-name-shadow-mode: t
global-font-lock-mode: t
font-lock-mode: t
blink-cursor-mode: t
auto-composition-mode: t
auto-encryption-mode: t
auto-compression-mode: t
line-number-mode: t
transient-mark-mode: t

Load-path shadows:
None found.

Features:
(shadow sort mail-extr cl-extra help-fns radix-tree cl-seq help-mode
debug emacsbug message subr-x puny dired dired-loaddefs format-spec
rfc822 mml mml-sec password-cache epa derived epg epg-config gnus-util
rmail rmail-loaddefs mm-decode mm-bodies mm-encode mail-parse rfc2231
mailabbrev gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums
mm-util mail-prsvr mail-utils js advice sgml-mode dom json map seq
byte-opt bytecomp byte-compile cconv imenu thingatpt cc-mode cc-fonts
easymenu cc-guess cc-menus cc-cmds cc-styles cc-align cc-engine cc-vars
cc-defs cl gv cl-loaddefs cl-lib time-date mule-util tooltip eldoc
electric uniquify ediff-hook vc-hooks lisp-float-type mwheel term/x-win
x-win term/common-win x-dnd tool-bar dnd fontset image regexp-opt fringe
tabulated-list replace newcomment text-mode elisp-mode lisp-mode
prog-mode register page menu-bar rfn-eshadow isearch timer select
scroll-bar mouse jit-lock font-lock syntax facemenu font-core
term/tty-colors frame cl-generic cham georgian utf-8-lang misc-lang
vietnamese tibetan thai tai-viet lao korean japanese eucjp-ms cp51932
hebrew greek romanian slovak czech european ethiopic indian cyrillic
chinese composite charscript charprop case-table epa-hook jka-cmpr-hook
help simple abbrev obarray minibuffer cl-preloaded nadvice loaddefs
button faces cus-face macroexp files text-properties overlay sha1 md5
base64 format env code-pages mule custom widget hashtable-print-readable
backquote dbusbind inotify dynamic-setting system-font-setting
font-render-setting move-toolbar gtk x-toolkit x multi-tty
make-network-process emacs)

Memory information:
((conses 16 133992 34676)
(symbols 48 23645 1)
(miscs 40 95 584)
(strings 32 30245 2336)
(string-bytes 1 974865)
(vectors 16 20847)
(vector-slots 8 721119 47152)
(floats 8 53 405)
(intervals 56 801 49)
(buffers 976 14))






^ permalink raw reply	[flat|nested] 14+ messages in thread

* bug#27403: 26.0.50; Indentation misalignment with Unicode code points >65535
  2017-06-17  6:28 bug#27403: 26.0.50; Indentation misalignment with Unicode code points >65535 Adam Niederer
@ 2017-06-17  7:18 ` Andreas Schwab
  2017-06-17  8:05 ` Eli Zaretskii
  1 sibling, 0 replies; 14+ messages in thread
From: Andreas Schwab @ 2017-06-17  7:18 UTC (permalink / raw)
  To: Adam Niederer; +Cc: 27403

On Jun 17 2017, Adam Niederer <adam.niederer@gmail.com> wrote:

> let x = /* 👍 */ { foo: 0
>                    bar: 0 }

(char-width ?👍) => 2

> let x = /* ☺ */ { foo: 0
>                   bar: 0 }

(char-width ?☺) => 1

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."





^ permalink raw reply	[flat|nested] 14+ messages in thread

* bug#27403: 26.0.50; Indentation misalignment with Unicode code points >65535
  2017-06-17  6:28 bug#27403: 26.0.50; Indentation misalignment with Unicode code points >65535 Adam Niederer
  2017-06-17  7:18 ` Andreas Schwab
@ 2017-06-17  8:05 ` Eli Zaretskii
  2017-06-17  8:24   ` Andreas Schwab
  1 sibling, 1 reply; 14+ messages in thread
From: Eli Zaretskii @ 2017-06-17  8:05 UTC (permalink / raw)
  To: Adam Niederer; +Cc: 27403

> From: Adam Niederer <adam.niederer@gmail.com>
> Date: Sat, 17 Jun 2017 02:28:41 -0400
> 
> Hello, I believe I've found an indentation issue. To reproduce, start
> emacs, create a buffer in js-mode, paste in this code, and press C-x h
> TAB to indent the buffer:
> 
> let x = /* 👍 */ { foo: 0
>                    bar: 0 }
> 
> let x = /* ☺ */ { foo: 0
>                   bar: 0 }
> 
> Both 25.2 and 26.0.50 add one extra space before "bar" in the first
> first snippet with U+1F44D THUMBS UP SIGN in the comment, whereas the
> second snippet with U+263A WHITE SMILING FACE properly aligns "bar" with
> "foo".

That's because U+1F44D is a double-width character:

  (char-width ?👍) => 2

while U+263A is not double-width.

So as long as indentation works in columns and not in pixels, this is
a "feature".

> This appears to happen whenever the character in the comment needs a
> surrogate pair.

I don't believe surrogates have anything to do with this, since Emacs
works with Unicode codepoints, not their UTF-16 encodings.

> Interestingly, pressing TAB with one's point on the second line of each
> snippet to dedent the line yields a correct result for both symbols:
> 
> "👍", {"a": 2,
>     "b": 3}
> 
> "☺", {"a":2,
>     "b":3}

Which is probably a subtle bug: this should behave like the first
snippet.

Thanks.





^ permalink raw reply	[flat|nested] 14+ messages in thread

* bug#27403: 26.0.50; Indentation misalignment with Unicode code points >65535
  2017-06-17  8:05 ` Eli Zaretskii
@ 2017-06-17  8:24   ` Andreas Schwab
  2017-06-17 10:28     ` Eli Zaretskii
  0 siblings, 1 reply; 14+ messages in thread
From: Andreas Schwab @ 2017-06-17  8:24 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 27403, Adam Niederer

On Jun 17 2017, Eli Zaretskii <eliz@gnu.org> wrote:

> That's because U+1F44D is a double-width character:
>
>   (char-width ?👍) => 2

The list in international/character.el is outdated.

> So as long as indentation works in columns and not in pixels, this is
> a "feature".

You surely don't want indentation to depend on font selection.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."





^ permalink raw reply	[flat|nested] 14+ messages in thread

* bug#27403: 26.0.50; Indentation misalignment with Unicode code points >65535
  2017-06-17  8:24   ` Andreas Schwab
@ 2017-06-17 10:28     ` Eli Zaretskii
  2017-06-17 12:09       ` Andreas Schwab
  0 siblings, 1 reply; 14+ messages in thread
From: Eli Zaretskii @ 2017-06-17 10:28 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: 27403, adam.niederer

> From: Andreas Schwab <schwab@linux-m68k.org>
> Cc: Adam Niederer <adam.niederer@gmail.com>,  27403@debbugs.gnu.org
> Date: Sat, 17 Jun 2017 10:24:41 +0200
> 
> On Jun 17 2017, Eli Zaretskii <eliz@gnu.org> wrote:
> 
> > That's because U+1F44D is a double-width character:
> >
> >   (char-width ?👍) => 2
> 
> The list in international/character.el is outdated.

I think the intent was to produce it from the Unicode data
(EastAsianWidth.txt).  I don't recall why this didn't happen; patches
are welcome.  Alternatively, synching the data with the latest Unicode
manually would be good as a stopgap.

> > So as long as indentation works in columns and not in pixels, this is
> > a "feature".
> 
> You surely don't want indentation to depend on font selection.

Patches for doing indentation in pixels are welcome, of course.





^ permalink raw reply	[flat|nested] 14+ messages in thread

* bug#27403: 26.0.50; Indentation misalignment with Unicode code points >65535
  2017-06-17 10:28     ` Eli Zaretskii
@ 2017-06-17 12:09       ` Andreas Schwab
  2017-06-17 13:39         ` Eli Zaretskii
  0 siblings, 1 reply; 14+ messages in thread
From: Andreas Schwab @ 2017-06-17 12:09 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 27403, adam.niederer

On Jun 17 2017, Eli Zaretskii <eliz@gnu.org> wrote:

> I think the intent was to produce it from the Unicode data
> (EastAsianWidth.txt).  I don't recall why this didn't happen; patches
> are welcome.  Alternatively, synching the data with the latest Unicode
> manually would be good as a stopgap.

Actually, even Unicode 10 lists it as double width.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."





^ permalink raw reply	[flat|nested] 14+ messages in thread

* bug#27403: 26.0.50; Indentation misalignment with Unicode code points >65535
  2017-06-17 12:09       ` Andreas Schwab
@ 2017-06-17 13:39         ` Eli Zaretskii
  2017-06-17 18:07           ` Andreas Schwab
  0 siblings, 1 reply; 14+ messages in thread
From: Eli Zaretskii @ 2017-06-17 13:39 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: 27403, adam.niederer

> From: Andreas Schwab <schwab@linux-m68k.org>
> Cc: adam.niederer@gmail.com,  27403@debbugs.gnu.org
> Date: Sat, 17 Jun 2017 14:09:44 +0200
> 
> On Jun 17 2017, Eli Zaretskii <eliz@gnu.org> wrote:
> 
> > I think the intent was to produce it from the Unicode data
> > (EastAsianWidth.txt).  I don't recall why this didn't happen; patches
> > are welcome.  Alternatively, synching the data with the latest Unicode
> > manually would be good as a stopgap.
> 
> Actually, even Unicode 10 lists it as double width.

OK, then why did you say the data was outdated?





^ permalink raw reply	[flat|nested] 14+ messages in thread

* bug#27403: 26.0.50; Indentation misalignment with Unicode code points >65535
  2017-06-17 13:39         ` Eli Zaretskii
@ 2017-06-17 18:07           ` Andreas Schwab
  2017-06-17 18:21             ` Eli Zaretskii
  0 siblings, 1 reply; 14+ messages in thread
From: Andreas Schwab @ 2017-06-17 18:07 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 27403, adam.niederer

On Jun 17 2017, Eli Zaretskii <eliz@gnu.org> wrote:

> OK, then why did you say the data was outdated?

Because it was.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."





^ permalink raw reply	[flat|nested] 14+ messages in thread

* bug#27403: 26.0.50; Indentation misalignment with Unicode code points >65535
  2017-06-17 18:07           ` Andreas Schwab
@ 2017-06-17 18:21             ` Eli Zaretskii
  2022-02-03 20:25               ` Lars Ingebrigtsen
  0 siblings, 1 reply; 14+ messages in thread
From: Eli Zaretskii @ 2017-06-17 18:21 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: 27403, adam.niederer

> From: Andreas Schwab <schwab@linux-m68k.org>
> Cc: adam.niederer@gmail.com,  27403@debbugs.gnu.org
> Date: Sat, 17 Jun 2017 20:07:56 +0200
> 
> On Jun 17 2017, Eli Zaretskii <eliz@gnu.org> wrote:
> 
> > OK, then why did you say the data was outdated?
> 
> Because it was.

Where's the up-to-date data we could use?





^ permalink raw reply	[flat|nested] 14+ messages in thread

* bug#27403: 26.0.50; Indentation misalignment with Unicode code points >65535
  2017-06-17 18:21             ` Eli Zaretskii
@ 2022-02-03 20:25               ` Lars Ingebrigtsen
  2022-02-04  7:05                 ` Eli Zaretskii
  0 siblings, 1 reply; 14+ messages in thread
From: Lars Ingebrigtsen @ 2022-02-03 20:25 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 27403, Andreas Schwab, adam.niederer

Eli Zaretskii <eliz@gnu.org> writes:

>> > OK, then why did you say the data was outdated?
>> 
>> Because it was.
>
> Where's the up-to-date data we could use?

I don't know either -- the character is still wide in Unicode 14.

👍 (1f44d) is here:

1F442..1F4FC;W   # So   [187] EAR..VIDEOCASSETTE

So it's "W", which is "wide"...

The document that has the widths, though refers to this:

https://www.unicode.org/reports/tr11/

---
This annex presents the specifications of a normative property for
Unicode characters that is useful when interoperating with East Asian
Legacy character sets.
---

So it's wide in the context of East Asian scripts, which isn't really
the primary usage of characters like emojis.

I tried googling around for a couple minutes to see whether Unicode has
made a data file that says something about typical character widths
outside of an East Asian context, and I can't find anything.

Uhm...  https://codepoints.net/U+1F44D?lang=en says it's neutral?
https://util.unicode.org/UnicodeJsps/character.jsp?a=1F44D says wide.

In the fonts I use, it's definitely wide.  But so is ☺, which is marked
as narrow.

So ❓

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 14+ messages in thread

* bug#27403: 26.0.50; Indentation misalignment with Unicode code points >65535
  2022-02-03 20:25               ` Lars Ingebrigtsen
@ 2022-02-04  7:05                 ` Eli Zaretskii
  2022-02-05  6:40                   ` Lars Ingebrigtsen
  0 siblings, 1 reply; 14+ messages in thread
From: Eli Zaretskii @ 2022-02-04  7:05 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: 27403, schwab, adam.niederer

> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: Andreas Schwab <schwab@linux-m68k.org>,  27403@debbugs.gnu.org,
>   adam.niederer@gmail.com
> Date: Thu, 03 Feb 2022 21:25:03 +0100
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> > OK, then why did you say the data was outdated?
> >> 
> >> Because it was.
> >
> > Where's the up-to-date data we could use?
> 
> I don't know either -- the character is still wide in Unicode 14.

I'm guessing Andreas meant the data of the other characters, not of
this one.

The current width data (5½ years later) is up-to-date with the latest
Unicode Standard version, at least AFAIK.  If someone finds a
mismatch, please point out specific discrepancies.

> In the fonts I use, it's definitely wide.  But so is ☺, which is marked
> as narrow.
> 
> So ❓

I don't see how this can be solved as long as indentation works in
columns.  If some font produces a glyph whose width isn't anywhere
close to the Unicode width specifications, what can we do except tell
people not to use those fonts?

Alternatively, if it turns out that most fonts use different width, we
could amend our char-width table to be consistent with those fonts.

Finally, users who aren't happy with either solution could customize
the char-width table in their own Emacs, it's just a char-table that
can be updated.





^ permalink raw reply	[flat|nested] 14+ messages in thread

* bug#27403: 26.0.50; Indentation misalignment with Unicode code points >65535
  2022-02-04  7:05                 ` Eli Zaretskii
@ 2022-02-05  6:40                   ` Lars Ingebrigtsen
  2022-02-05  7:51                     ` Eli Zaretskii
  0 siblings, 1 reply; 14+ messages in thread
From: Lars Ingebrigtsen @ 2022-02-05  6:40 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 27403, schwab, adam.niederer

Eli Zaretskii <eliz@gnu.org> writes:

>> In the fonts I use, it's definitely wide.  But so is ☺, which is marked
>> as narrow.
>> 
>> So ❓
>
> I don't see how this can be solved as long as indentation works in
> columns.  If some font produces a glyph whose width isn't anywhere
> close to the Unicode width specifications, what can we do except tell
> people not to use those fonts?
>
> Alternatively, if it turns out that most fonts use different width, we
> could amend our char-width table to be consistent with those fonts.

Yes, it would be nice if this worked better out-of-the-box for most
people, but I wouldn't want to manually maintain a list of typical char
widths, either.

By the way, ☺ in the terminal here (Debian/bullseye) does take 1
character while 👍 takes two, so perhaps they're also using the same
Unicode data that we're using...

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 14+ messages in thread

* bug#27403: 26.0.50; Indentation misalignment with Unicode code points >65535
  2022-02-05  6:40                   ` Lars Ingebrigtsen
@ 2022-02-05  7:51                     ` Eli Zaretskii
  2022-02-05  7:55                       ` Lars Ingebrigtsen
  0 siblings, 1 reply; 14+ messages in thread
From: Eli Zaretskii @ 2022-02-05  7:51 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: 27403, schwab, adam.niederer

> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: schwab@linux-m68k.org,  27403@debbugs.gnu.org,  adam.niederer@gmail.com
> Date: Sat, 05 Feb 2022 07:40:10 +0100
> 
> By the way, ☺ in the terminal here (Debian/bullseye) does take 1
> character while 👍 takes two, so perhaps they're also using the same
> Unicode data that we're using...

Well-behaved terminal emulators indeed do use the same tables.





^ permalink raw reply	[flat|nested] 14+ messages in thread

* bug#27403: 26.0.50; Indentation misalignment with Unicode code points >65535
  2022-02-05  7:51                     ` Eli Zaretskii
@ 2022-02-05  7:55                       ` Lars Ingebrigtsen
  0 siblings, 0 replies; 14+ messages in thread
From: Lars Ingebrigtsen @ 2022-02-05  7:55 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 27403, schwab, adam.niederer

Eli Zaretskii <eliz@gnu.org> writes:

>> By the way, ☺ in the terminal here (Debian/bullseye) does take 1
>> character while 👍 takes two, so perhaps they're also using the same
>> Unicode data that we're using...
>
> Well-behaved terminal emulators indeed do use the same tables.

So the test code in question indents "properly" in emacs -nw (at least
with this terminal).

I guess there's not really anything we can do further on the Emacs side
here: When source code contains characters that use many fonts, the
indentation will look visually different for different people on
different systems, and there isn't much we can do about that.  So I'm
therefore closing this bug report.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-02-05  7:55 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-06-17  6:28 bug#27403: 26.0.50; Indentation misalignment with Unicode code points >65535 Adam Niederer
2017-06-17  7:18 ` Andreas Schwab
2017-06-17  8:05 ` Eli Zaretskii
2017-06-17  8:24   ` Andreas Schwab
2017-06-17 10:28     ` Eli Zaretskii
2017-06-17 12:09       ` Andreas Schwab
2017-06-17 13:39         ` Eli Zaretskii
2017-06-17 18:07           ` Andreas Schwab
2017-06-17 18:21             ` Eli Zaretskii
2022-02-03 20:25               ` Lars Ingebrigtsen
2022-02-04  7:05                 ` Eli Zaretskii
2022-02-05  6:40                   ` Lars Ingebrigtsen
2022-02-05  7:51                     ` Eli Zaretskii
2022-02-05  7:55                       ` Lars Ingebrigtsen

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).