unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: Rich Felker <dalias@aerifal.cx>
Subject: BUG: Emacs ignores charcell width when running on terminal (w/rtfs & ideas for fix)
Date: Wed, 11 Oct 2006 15:16:50 -0400	[thread overview]
Message-ID: <20061011191650.GA13329@brightrain.aerifal.cx> (raw)

[-- Attachment #1: Type: text/plain, Size: 3011 bytes --]

When GNU Emacs is run on a terminal (-nw mode) and editing UTF-8 text
files, it treats all characters as if they occupy one character cell
column on the terminal. This causes it to become confused about the
cursor position whenever there is CJK fullwidth text or scripts that
use nonspacing combining characters present, to the point that editing
is impossible.

My coding system settings:
(setq locale-coding-system 'utf-8)
(set-terminal-coding-system 'utf-8)
(set-keyboard-coding-system 'utf-8)
(set-selection-coding-system 'utf-8)
(prefer-coding-system 'utf-8)

I run emacs inside GNU screen, running on a real UTF-8 terminal, but
if you don't have a real UTF-8 terminal, screen can emulate UTF-8
(showing ? for unavailable width-1 characters and ?? for unavailable
width-2 characters) on any terminal. Using a UTF-8 xterm or other
terminal that supports UTF-8 may make it easier to see the problem
though.

Attached to this email is a UTF-8 file you can open in Emacs which
exhibits the problem: Japanese Hiragana (for CJK wide) and Tibetan and
Thai (for nonspacing).

The root of the problem: In term.c, produce_glyphs() function, the
code assumes all multibyte characters for a given 'charset' have the
same width:

      /* A multi-byte character.  The display width is fixed for all
	 characters of the set.  Some of the glyphs may have to be
	 ignored because they are already displayed in a continued
	 line.  */
      int charset = CHAR_CHARSET (it->c);
      it->pixel_width = CHARSET_WIDTH (charset);

I put together a horrible elaborate hack to work around this:

      struct glyph glyph = { .type = CHAR_GLYPH, .u = { .ch = it->c } };
      char *foo = encode_terminal_code (&glyph, 1, &terminal_coding);
      wchar_t wc = dec_utf8(foo); /* naive utf8 decode function */
      it->pixel_width = mk_wcwidth(wc); /* Kuhn's UCS wcwidth func */

But it's incorrect and assumes the terminal encoding is UTF-8.. not to
mention it's quite inefficient and ugly. (Note: for term.c, "pixel"
means character cell.)

With this change made, CJK characters are correctly treated as two
columns, and combining marks as 0, however combining marks disappear
_entirely_ due to the loop in append_glyph() (term.c) never executing
if width==0.

Correctly fixing the issue:

1. Needs some sort of width lookup for unicode characters without
   having to convert from Emacs' native encoding to UCS thru UTF-8.
   This should be straightforward for someone who understands the
   code.

2. The apppend_glyph() function needs to handle width==0 case, perhaps
   converting the previous glyph into a COMPOSITE_GLYPH instead of
   adding a CHAR_GLYPH. However I don't understand the COMPOSITE_GLYPH
   system in Emacs so I don't know if this is feasible.

At present this issue is making it very difficult for me to use
Tibetan text in composing email and material for the web, so I'm
looking for some way to fix it, either upstream or with hacks I can
make locally for the time being until it's fixed properly.

Rich



[-- Attachment #2: example.txt --]
[-- Type: text/plain, Size: 101 bytes --]

にほんご
བོད་སྐད་
ภาษาไทย: กิน กิน กิน กิน

[-- Attachment #3: Type: text/plain, Size: 149 bytes --]

_______________________________________________
bug-gnu-emacs mailing list
bug-gnu-emacs@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-gnu-emacs

                 reply	other threads:[~2006-10-11 19:16 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20061011191650.GA13329@brightrain.aerifal.cx \
    --to=dalias@aerifal.cx \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).