From: Peter Dyballa <Peter_Dyballa@Web.DE>
Cc: help-gnu-emacs@gnu.org
Subject: Re: UTF-8 in path / filename
Date: Sat, 26 Aug 2006 11:36:34 +0200 [thread overview]
Message-ID: <0C15C504-B711-403E-B8D1-F03234C453E3@Web.DE> (raw)
In-Reply-To: <87odu8ct0a.fsf@catnip.gol.com>
Am 26.08.2006 um 01:09 schrieb Miles Bader:
> Peter Dyballa <Peter_Dyballa@Web.DE> writes:
>> There won't be a perfect solution with GNU Emacs in the near
>> future ...
>
> You constantly seem to be having problems with UTF-8, but it works
> absolutely perfectly for me, filenames, dired, everything (using
> emacs 22).
>
> [It works perfectly even if I do `emacs -Q' to avoid loading my init
> file, though I normally use (set-language-environment 'japanese).]
>
> AFAIK the main thing is that your LANG environment variable be set to
> something mentioning utf-8 -- I use "ja_JP.UTF-8".
>
pete 39 /\ .
/Users/pete
pete 40 /\ env | egrep -i 'LC|LANG'
LANG=de_DE.UTF-8
LC_CTYPE=de_DE.UTF-8
pete 41 /\ /usr/local/bin/emacs-22.0.50 -Q &
Files with UTF-8 characters in them are shown in dired (has -u: in
mode-line, i.e. uses UTF-8) à la <vowel><empty box>. Some UTF-8
characters like ß or Û show up as themselves. In the same manner they
appear in the buffer's mode-line, once visited, and also in the list
of buffers buffer (C-x b), completely unreadable in the Buffers menu
from menu bar and in another completely unreadable fashion in the
"Buffer Menu" pop-up. The font used for the vowels, the empty boxes,
or the other characters is taken from the Java SDK and quite rich
(1425 mapped characters for mostly European and some near eastern
scripts):
-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1 (#x61)
-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1 (#x308)
-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1 (#xDF)
-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1 (#x20AC)
Somehow this looks like a mixture of ISO 8859 characters (#x61, #xDF)
and Unicode (#x20AC) and something else (#x308) or are some
representations just abbreviations that leave away the 'leading zeros?'
The other information from C-u C-x = on the examples is:
character: a (97, #o141, #x61, U+0061)
charset: ascii (ASCII (ISO646 IRV))
code point: #x61
syntax: w which means: word
category: a:ASCII l:Latin
buffer code: #x61
file code: #x61 (encoded by coding system mule-utf-8)
character: (332488, #o1211310, #x512c8, U+0308)
charset: mule-unicode-0100-24ff (Unicode characters of the range
U+0100..U+24FF.)
code point: #x25 #x48
syntax: w which means: word
category: ^:Combining diacritic or mark
buffer code: #x9C #xF4 #xA5 #xC8
file code: #xCC #x88 (encoded by coding system mule-utf-8)
character: ß (2271, #o4337, #x8df, U+00DF)
charset: latin-iso8859-1 (Right-Hand Part of Latin Alphabet 1
(ISO/IEC 8859-1): ISO-IR-100.)
code point: #x5F
syntax: w which means: word
category: l:Latin
buffer code: #x81 #xDF
file code: #xC3 #x9F (encoded by coding system mule-utf-8)
character: Û (342604, #o1235114, #x53a4c, U+20AC)
charset: mule-unicode-0100-24ff (Unicode characters of the range
U+0100..U+24FF.)
code point: #x74 #x4C
syntax: w which means: word
buffer code: #x9C #xF4 #xF4 #xCC
file code: #xE2 #x82 #xAC (encoded by coding system mule-utf-8)
An excerpt from the fontset's description (I am missing ISO 8859-16!):
Fontset: -*-*-medium-r-*-*-10-*-*-*-m-*-fontset-startup
CHARSET or CHAR RANGE FONT NAME
--------------------- ---------
ascii -b&h-lucidatypewriter-medium-r-normal-sans-10-100-75-75-m-60-
iso10646-1
[-Adobe-Courier-Medium-R-Normal--10-100-75-75-M-60-ISO10646-1]
[-B&H-LucidaTypewriter-Bold-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1]
[-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1]
latin-iso8859-1 -b&h-lucidatypewriter-*-iso10646-1
[-B&H-LucidaTypewriter-Bold-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1]
[-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1]
latin-iso8859-2 -*-iso8859-2
latin-iso8859-3 -*-iso8859-3
latin-iso8859-4 -*-iso8859-4
thai-tis620 -*-*-*-tis620-*
greek-iso8859-7 -*-iso8859-7
arabic-iso8859-6 -*-iso8859-6
hebrew-iso8859-8 -*-iso8859-8
katakana-jisx0201 -*-jisx0201-*
latin-jisx0201 -*-jisx0201-*
cyrillic-iso8859-5 -*-iso8859-5
latin-iso8859-9 -*-iso8859-9
latin-iso8859-15 -*-iso8859-15
latin-iso8859-14 -*-iso8859-14
...
mule-unicode-2500-33ff -b&h-lucidatypewriter-*-iso10646-1
mule-unicode-e000-ffff -b&h-lucidatypewriter-*-iso10646-1
mule-unicode-0100-24ff -b&h-lucidatypewriter-*-iso10646-1
[-B&H-LucidaTypewriter-Bold-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1]
[-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1]
...
IMO the display of UTF-8 characters is not sufficient.
> If that doesn't work, I dunno, maybe it's something screwy about
> the mac.
>
There is something special, possibly screwy, in Mac OS X's (or
better: HFS+', the file system's) way to store UTF-8 characters in
file names: they get de-composed, i.e. an ä becomes a¨, an à becomes
a`, etc. (and only these, a file's contents does not get de-composed
how would such a JPEG picture look like?). So two or three octets
in the string on disk are expanded to a pair of one octet and
(mostly ?) two octets. GNU Emacs should be able to detect that: if a
character is from the category (see above) "Combining diacritic or
mark" it can't stand alone by nature, but must be combined with the
character on the left in a left to right writing system or with the
character on the right in a right to left writing system (I have no
idea of the rules in a top to bottom writing system like Mongolian
and whether these have combining characters). And it should be able
to handle the character categories correctly.
--
Greetings
Pete
What¹s the difference between OS X and Vista?
Microsoft employees are excited about OS X
next prev parent reply other threads:[~2006-08-26 9:36 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-08-24 13:59 UTF-8 in path / filename Grégory SCHMITT
2006-08-24 14:42 ` Noah Slater
2006-08-25 12:08 ` Peter Dyballa
[not found] ` <mailman.5606.1156507702.9609.help-gnu-emacs@gnu.org>
2006-08-25 13:42 ` Grégory SCHMITT
2006-08-25 18:35 ` Peter Dyballa
2006-08-25 22:06 ` Grégory SCHMITT
2006-08-25 22:55 ` Peter Dyballa
[not found] ` <mailman.5656.1156546542.9609.help-gnu-emacs@gnu.org>
2006-08-25 23:06 ` Grégory SCHMITT
2006-08-25 23:09 ` Miles Bader
2006-08-26 9:36 ` Peter Dyballa [this message]
2006-08-26 22:13 ` James Cloos
2006-08-27 13:12 ` Peter Dyballa
2006-08-28 15:11 ` James Cloos
2006-08-28 15:55 ` Peter Dyballa
[not found] ` <mailman.5694.1156630455.9609.help-gnu-emacs@gnu.org>
2006-08-27 8:46 ` Harald Hanche-Olsen
[not found] ` <mailman.5657.1156547377.9609.help-gnu-emacs@gnu.org>
2006-08-25 23:22 ` Grégory SCHMITT
2006-08-25 23:25 ` Miles Bader
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/emacs/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0C15C504-B711-403E-B8D1-F03234C453E3@Web.DE \
--to=peter_dyballa@web.de \
--cc=help-gnu-emacs@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).