unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed
From: Peter Dyballa <Peter_Dyballa@Web.DE>
Cc: help-gnu-emacs@gnu.org
Subject: Re: UTF-8 in path / filename
Date: Sat, 26 Aug 2006 11:36:34 +0200	[thread overview]
Message-ID: <0C15C504-B711-403E-B8D1-F03234C453E3@Web.DE> (raw)
In-Reply-To: <87odu8ct0a.fsf@catnip.gol.com>


Am 26.08.2006 um 01:09 schrieb Miles Bader:

> Peter Dyballa <Peter_Dyballa@Web.DE> writes:
>> There won't be a perfect solution with GNU Emacs in the near  
>> future ...
>
> You constantly seem to be having problems with UTF-8, but it works
> absolutely perfectly for me, filenames, dired, everything (using  
> emacs 22).
>
> [It works perfectly even if I do `emacs -Q' to avoid loading my init
> file, though I normally use (set-language-environment 'japanese).]
>
> AFAIK the main thing is that your LANG environment variable be set to
> something mentioning utf-8 -- I use "ja_JP.UTF-8".
>

	pete 39 /\ .
	/Users/pete
	pete 40 /\ env | egrep -i 'LC|LANG'
	LANG=de_DE.UTF-8
	LC_CTYPE=de_DE.UTF-8
	pete 41 /\  /usr/local/bin/emacs-22.0.50 -Q &

Files with UTF-8 characters in them are shown in dired (has -u: in  
mode-line, i.e. uses UTF-8) à la <vowel><empty box>. Some UTF-8  
characters like ß or Û show up as themselves. In the same manner they  
appear in the buffer's mode-line, once visited, and also in the list  
of buffers buffer (C-x b), completely unreadable in the Buffers menu  
from menu bar and in another completely unreadable fashion in the  
"Buffer Menu" pop-up. The font used for the vowels, the empty boxes,  
or the other characters is taken from the Java SDK and quite rich  
(1425 mapped characters for mostly European and some near eastern  
scripts):

      -B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60- 
ISO10646-1 (#x61)
      -B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60- 
ISO10646-1 (#x308)
      -B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60- 
ISO10646-1 (#xDF)
      -B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60- 
ISO10646-1 (#x20AC)

Somehow this looks like a mixture of ISO 8859 characters (#x61, #xDF)  
and Unicode (#x20AC) and something else (#x308) ­ or are some  
representations just abbreviations that leave away the 'leading zeros?'

The other information from C-u C-x = on the examples is:

   character: a (97, #o141, #x61, U+0061)
     charset: ascii (ASCII (ISO646 IRV))
code point: #x61
      syntax: w 	which means: word
    category: a:ASCII l:Latin
buffer code: #x61
   file code: #x61 (encoded by coding system mule-utf-8)

   character:  (332488, #o1211310, #x512c8, U+0308)
     charset: mule-unicode-0100-24ff (Unicode characters of the range  
U+0100..U+24FF.)
code point: #x25 #x48
      syntax: w 	which means: word
    category: ^:Combining diacritic or mark
buffer code: #x9C #xF4 #xA5 #xC8
   file code: #xCC #x88 (encoded by coding system mule-utf-8)

   character: ß (2271, #o4337, #x8df, U+00DF)
     charset: latin-iso8859-1 (Right-Hand Part of Latin Alphabet 1  
(ISO/IEC 8859-1): ISO-IR-100.)
code point: #x5F
      syntax: w 	which means: word
    category: l:Latin
buffer code: #x81 #xDF
   file code: #xC3 #x9F (encoded by coding system mule-utf-8)

   character: Û (342604, #o1235114, #x53a4c, U+20AC)
     charset: mule-unicode-0100-24ff (Unicode characters of the range  
U+0100..U+24FF.)
code point: #x74 #x4C
      syntax: w 	which means: word
buffer code: #x9C #xF4 #xF4 #xCC
   file code: #xE2 #x82 #xAC (encoded by coding system mule-utf-8)

An excerpt from the fontset's description (I am missing ISO 8859-16!):

Fontset: -*-*-medium-r-*-*-10-*-*-*-m-*-fontset-startup
CHARSET or CHAR RANGE	FONT NAME
---------------------	---------
ascii			-b&h-lucidatypewriter-medium-r-normal-sans-10-100-75-75-m-60- 
iso10646-1
      [-Adobe-Courier-Medium-R-Normal--10-100-75-75-M-60-ISO10646-1]
      [-B&H-LucidaTypewriter-Bold-R-Normal-Sans-10-100-75-75-M-60- 
ISO10646-1]
      [-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60- 
ISO10646-1]
latin-iso8859-1		-b&h-lucidatypewriter-*-iso10646-1
      [-B&H-LucidaTypewriter-Bold-R-Normal-Sans-10-100-75-75-M-60- 
ISO10646-1]
      [-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60- 
ISO10646-1]
latin-iso8859-2		-*-iso8859-2
latin-iso8859-3		-*-iso8859-3
latin-iso8859-4		-*-iso8859-4
thai-tis620		-*-*-*-tis620-*
greek-iso8859-7		-*-iso8859-7
arabic-iso8859-6	-*-iso8859-6
hebrew-iso8859-8	-*-iso8859-8
katakana-jisx0201	-*-jisx0201-*
latin-jisx0201		-*-jisx0201-*
cyrillic-iso8859-5	-*-iso8859-5
latin-iso8859-9		-*-iso8859-9
latin-iso8859-15	-*-iso8859-15
latin-iso8859-14	-*-iso8859-14
...
mule-unicode-2500-33ff	-b&h-lucidatypewriter-*-iso10646-1
mule-unicode-e000-ffff	-b&h-lucidatypewriter-*-iso10646-1
mule-unicode-0100-24ff	-b&h-lucidatypewriter-*-iso10646-1
      [-B&H-LucidaTypewriter-Bold-R-Normal-Sans-10-100-75-75-M-60- 
ISO10646-1]
      [-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60- 
ISO10646-1]
...

IMO the display of UTF-8 characters is not sufficient.


> If that doesn't work, I dunno, maybe it's something screwy about  
> the mac.
>

There is something special, possibly screwy, in Mac OS X's (or  
better: HFS+', the file system's) way to store UTF-8 characters in  
file names: they get de-composed, i.e. an ä becomes a¨, an à becomes  
a`, etc. (and only these, a file's contents does not get de-composed  
­ how would such a JPEG picture look like?). So two or three octets  
in the string on disk are expanded to a pair of one octet and  
(mostly ?) two octets. GNU Emacs should be able to detect that: if a  
character is from the category (see above) "Combining diacritic or  
mark" it can't stand alone by nature, but must be combined with the  
character on the left in a left to right writing system or with the  
character on the right in a right to left writing system (I have no  
idea of the rules in a top to bottom writing system like Mongolian ­  
and whether these have combining characters). And it should be able  
to handle the character categories correctly.

--
Greetings

   Pete

What¹s the difference between OS X and Vista?

Microsoft employees are excited about OS XŠ

  reply	other threads:[~2006-08-26  9:36 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-08-24 13:59 UTF-8 in path / filename Grégory SCHMITT
2006-08-24 14:42 ` Noah Slater
2006-08-25 12:08 ` Peter Dyballa
     [not found] ` <mailman.5606.1156507702.9609.help-gnu-emacs@gnu.org>
2006-08-25 13:42   ` Grégory SCHMITT
2006-08-25 18:35     ` Peter Dyballa
2006-08-25 22:06       ` Grégory SCHMITT
2006-08-25 22:55         ` Peter Dyballa
     [not found]         ` <mailman.5656.1156546542.9609.help-gnu-emacs@gnu.org>
2006-08-25 23:06           ` Grégory SCHMITT
2006-08-25 23:09           ` Miles Bader
2006-08-26  9:36             ` Peter Dyballa [this message]
2006-08-26 22:13               ` James Cloos
2006-08-27 13:12                 ` Peter Dyballa
2006-08-28 15:11                   ` James Cloos
2006-08-28 15:55                     ` Peter Dyballa
     [not found]               ` <mailman.5694.1156630455.9609.help-gnu-emacs@gnu.org>
2006-08-27  8:46                 ` Harald Hanche-Olsen
     [not found]           ` <mailman.5657.1156547377.9609.help-gnu-emacs@gnu.org>
2006-08-25 23:22             ` Grégory SCHMITT
2006-08-25 23:25               ` Miles Bader

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0C15C504-B711-403E-B8D1-F03234C453E3@Web.DE \
    --to=peter_dyballa@web.de \
    --cc=help-gnu-emacs@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).