* UTF-8 in path / filename
@ 2006-08-24 13:59 Grégory SCHMITT
2006-08-24 14:42 ` Noah Slater
` (2 more replies)
0 siblings, 3 replies; 17+ messages in thread
From: Grégory SCHMITT @ 2006-08-24 13:59 UTC (permalink / raw)
Hi everyone,
I'm running emacs 21.4.1 using Linux (Fedora Core 5). When I try to open a
file and the path name contains UTF-8 letters, emacs won't be able to find
the file.
I create a folder called "Grégory". I put any file in it (let's call it
"test") and if I, from a simple xterm, try to do "emacs Grégory/test",
emacs won't be able to open the file. However, it will be successful if I
manually visit using C-x C-f.
If I use any other editor (such as mcedit), it will open OK.
Any explanation ?
--
Grégory SCHMITT <mailto:gregory.schmitt@free.fr>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: UTF-8 in path / filename
2006-08-24 13:59 UTF-8 in path / filename Grégory SCHMITT
@ 2006-08-24 14:42 ` Noah Slater
2006-08-25 12:08 ` Peter Dyballa
[not found] ` <mailman.5606.1156507702.9609.help-gnu-emacs@gnu.org>
2 siblings, 0 replies; 17+ messages in thread
From: Noah Slater @ 2006-08-24 14:42 UTC (permalink / raw)
Cc: help-gnu-emacs
[-- Attachment #1: Type: text/plain, Size: 373 bytes --]
Grégory,
What is the command you are using? Perhaps xterm is configured
incorrectly and is mangling the file path before passing to Emacs.
What happens if you tab complete the file name in the shell?
Does the same happen with uxterm?
Thanks,
Noah
--
"Creativity can be a social contribution, but only in so
far as society is free to use the results." - R. Stallman
[-- Attachment #2: Type: text/plain, Size: 152 bytes --]
_______________________________________________
help-gnu-emacs mailing list
help-gnu-emacs@gnu.org
http://lists.gnu.org/mailman/listinfo/help-gnu-emacs
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: UTF-8 in path / filename
2006-08-24 13:59 UTF-8 in path / filename Grégory SCHMITT
2006-08-24 14:42 ` Noah Slater
@ 2006-08-25 12:08 ` Peter Dyballa
[not found] ` <mailman.5606.1156507702.9609.help-gnu-emacs@gnu.org>
2 siblings, 0 replies; 17+ messages in thread
From: Peter Dyballa @ 2006-08-25 12:08 UTC (permalink / raw)
Cc: help-gnu-emacs
Am 24.08.2006 um 15:59 schrieb Grégory SCHMITT:
> Hi everyone,
>
> I'm running emacs 21.4.1 using Linux (Fedora Core 5). When I try to
> open a
> file and the path name contains UTF-8 letters, emacs won't be able
> to find
> the file.
>
> I create a folder called "Grégory". I put any file in it (let's
> call it
> "test") and if I, from a simple xterm, try to do "emacs Grégory/test",
> emacs won't be able to open the file. However, it will be
> successful if I
> manually visit using C-x C-f.
>
> If I use any other editor (such as mcedit), it will open OK.
>
> Any explanation ?
>
Yes: your terminal emulation/shell swallows/hides information.
On Mac OS X in Apple's Terminal (TERM is xterm-color) I can see UTF-8
filenames, for example äöüßÜÖÄ€. File name expansion/completion
does *not* work on them (although RGB äöüæÆÜÖÄ.txt gets
expanded to RGB a?^?o?^?u?^?æ?^?U?^?O?^?A?^?.txt). And of course it
does not work to invoke GNU Emacs with this file name as argument (or
'built-in' vi, nano. It *works* though when I do that from the
*shell* buffer in Unicode Emacs 23.0.0 or GNU Emacs 22.0.50 ...
(although no file name completion and the latter showing the ¨ as
empty boxes in the file name) If I for example paste a name with
UTF-8 contents from ls output to pass it to vi (it gives the best
complaints) I can see that the de-composed UTF-8 characters are
strangely interpreted. An ä seems to vanish and become kind of
control character, the ¨ component of A¨, i.e. Ä, is passed as <cc>
or such ...
Since in your case mcedit accepts the file name, mcedit and your
terminal seem to use the same character encoding, so for both é *is*
an é. GNU Emacs lives in its own world of almost indefinite character
encodings. One way to make Emacs work correctly is to set environment
variables like LC_All, LANG, or LC_CTYPE which obviously just repeat
what your shell and your OS' standard utilities know. Next is *not*
to set current-language-environment! From LC_CTYPE etc. Emacs learns
what encodings to set for buffer contents, file names, process data.
If it makes mistakes in this you might consider to use
(prefer-coding-system 'iso-latin-9-unix) ; the one with €
or a few such statements with different codings each. GNU Emacs will
then try to apply these encodings first. Since you're working with a
non-Unicode Emacs you might need to set
(unify-8859-on-decoding-mode t)
(unify-8859-on-encoding-mode t)
to make the 8 bit ISO Latin encodings be handled as quite the same,
i.e. é would be in any of these encodings in which it exists the
same, i.e. you could search for it in all buffers and you only once
told isearch to look for é.
One important thing is that *you* already messed up your .emacs file.
Try to launch it also with --no-init-file and/or --no-site-file and
also with -nw, i.e. running inside the terminal without X windows.
--
Greetings
Pete
The most exciting phrase to hear in science, the one that heralds new
discoveries, is not "Eureka!" (I found it!) but "That's funny..."
Isaac Asimov
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: UTF-8 in path / filename
[not found] ` <mailman.5606.1156507702.9609.help-gnu-emacs@gnu.org>
@ 2006-08-25 13:42 ` Grégory SCHMITT
2006-08-25 18:35 ` Peter Dyballa
0 siblings, 1 reply; 17+ messages in thread
From: Grégory SCHMITT @ 2006-08-25 13:42 UTC (permalink / raw)
Le Fri, 25 Aug 2006 14:08:11 +0200, Peter Dyballa a écrit :
> One important thing is that *you* already messed up your .emacs file. Try
> to launch it also with --no-init-file and/or --no-site-file and also with
> -nw, i.e. running inside the terminal without X windows.
OK. I did it. I move my .emacs to another place, even though I never
really modified it. Still no success. For info, my locale is set as
LANG="fr_FR.UTF-8" (and that's all: no LC_TYPE... or other). My
terminal is a xterm (such as yours); I tried from the console, with
bash only, and that was still the same result.
If I set emacs to run in unibyte mode (with --unibyte on the command
line), it does work, but the file content (which is UTF-8) is parsed as
8859-15.
--
Grégory SCHMITT <mailto:gregory.schmitt@free.fr>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: UTF-8 in path / filename
2006-08-25 13:42 ` Grégory SCHMITT
@ 2006-08-25 18:35 ` Peter Dyballa
2006-08-25 22:06 ` Grégory SCHMITT
0 siblings, 1 reply; 17+ messages in thread
From: Peter Dyballa @ 2006-08-25 18:35 UTC (permalink / raw)
Cc: help-gnu-emacs
Am 25.08.2006 um 15:42 schrieb Grégory SCHMITT:
> If I set emacs to run in unibyte mode (with --unibyte on the command
> line), it does work, but the file content (which is UTF-8) is
> parsed as
> 8859-15.
>
This looks as if your system does not use UTF-8 ...
Can you create a file with accented characters? If not, can you put a
copy of the file in the Grégory directory into your home or some
other directory and invoke emacs, with or with no unibytes, with both
files? In the first case the accented name would appear in the mode-
line of the buffer (and would see what was passed or received as
argument), in the latter case GNU Emacs would put the directory's
name in the mode-line, I hope, to distinguish the two files with the
same name. Again, you would see what was passed or received as
"Grégory" ...
If the file names are or are not UTF-8, you can declare this
in .emacs with:
(setq default-file-name-coding-system 'utf-8)
(setq default-file-name-coding-system 'iso-8859-15)
There are a lot more *coding-systems you can set ...
--
Greetings
Pete
<\
\__ O __O
| O\ _\\/\-% _`\<,
'()-'-(_)--(_) (_)/(_)
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: UTF-8 in path / filename
2006-08-25 18:35 ` Peter Dyballa
@ 2006-08-25 22:06 ` Grégory SCHMITT
2006-08-25 22:55 ` Peter Dyballa
[not found] ` <mailman.5656.1156546542.9609.help-gnu-emacs@gnu.org>
0 siblings, 2 replies; 17+ messages in thread
From: Grégory SCHMITT @ 2006-08-25 22:06 UTC (permalink / raw)
Cc: help-gnu-emacs
> ----- Original Message -----
> Date: Fri, 25 Aug 2006 20:35:08 +0200
> From: Peter Dyballa <Peter_Dyballa@Web.DE>
> To: Grégory SCHMITT <gregory.schmitt@free.fr>
> Cc: help-gnu-emacs@gnu.org
> Subject: Re: UTF-8 in path / filename
>
> Am 25.08.2006 um 15:42 schrieb Grégory SCHMITT:
>
> > If I set emacs to run in unibyte mode (with --unibyte on the command
> > line), it does work, but the file content (which is UTF-8) is
> > parsed as
> > 8859-15.
> >
>
> This looks as if your system does not use UTF-8 ...
I thought Fedora uses UTF-8 by default.
> Can you create a file with accented characters? If not, can you put a
> copy of the file in the Grégory directory into your home or some
> other directory and invoke emacs, with or with no unibytes, with both
> files? In the first case the accented name would appear in the mode-
> line of the buffer (and would see what was passed or received as
> argument), in the latter case GNU Emacs would put the directory's
> name in the mode-line, I hope, to distinguish the two files with the
> same name. Again, you would see what was passed or received as
> "Grégory" ...
>
> If the file names are or are not UTF-8, you can declare this
> in .emacs with:
>
> (setq default-file-name-coding-system 'utf-8)
> (setq default-file-name-coding-system 'iso-8859-15)
OK. So I have tow folders, "Greg" and "Grégory" in my home (ext3
filesystem, default options). I now have two file, "test" and "testé"
in each of them, plus in the current directory. Those files have the
same Utf-8 content, so I'm able to tell if they're parsed correctly or
not.
First case, with multibyte:
- both files in the "Greg" folder are visited correctly: file is
opened, content looks ok. However, the buffer name for "testé" appears
as "testÀ" (or sth like that), which in my mind is proof that the file
name is actually UTF-8 and displayed like ISO.
- both files in the "Grégory" folder are not visited. Manually visiting
the files works fine however, and the buffer name is correct ("testé" is
spelled correctly).File content is ok as well.
- both files in the current directory are visited ok, content is ok,
buffer name NOT ok.
Second, with unibyte:
- "Greg" folder: files visited ok, content NOT ok, buffer name NOT ok.
- "Grégory" folder: files visited ok, content NOT ok, buffer name NOT
ok.
- both files in the current directory are visited ok, content NOT ok,
buffer name NOT ok.
Hope that helps. As for me, I'm stuck...
--
Grégory SCHMITT <mailto:gregory.schmitt@free.fr>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: UTF-8 in path / filename
2006-08-25 22:06 ` Grégory SCHMITT
@ 2006-08-25 22:55 ` Peter Dyballa
[not found] ` <mailman.5656.1156546542.9609.help-gnu-emacs@gnu.org>
1 sibling, 0 replies; 17+ messages in thread
From: Peter Dyballa @ 2006-08-25 22:55 UTC (permalink / raw)
Cc: help-gnu-emacs
Am 26.08.2006 um 00:06 schrieb Grégory SCHMITT:
> Hope that helps. As for me, I'm stuck...
I feel the same! All GNU Emacsen are not meant to handle UTF-8 as
other applications can do. Unicode Emacs 23.0.0 behaves a bit better.
What you could try is to set default-buffer-file-coding-system to
utf-8. It could also be that some preparation in file-coding-system-
alist does not let you see UTF-8 contents, so check its value. I have
in my customisation section
'(unibyte-display-via-language-environment t)
and avoid set-language-environment.
There won't be a perfect solution with GNU Emacs in the near future ...
Is your Emacs copy installed from an RPM package or did you configure
and compile yourself? For me UTF-8 and Emacs are too important to get
it from somewhere, so I compile myself.
--
Greetings
Pete
»¿ʇı̣ əsnqɐ ʇ,uɐɔ noʎ ɟı̣
ɓuı̣ɥʇʎuɐ sı̣ pooɓ ʇɐɥʍ«
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: UTF-8 in path / filename
[not found] ` <mailman.5656.1156546542.9609.help-gnu-emacs@gnu.org>
@ 2006-08-25 23:06 ` Grégory SCHMITT
2006-08-25 23:09 ` Miles Bader
[not found] ` <mailman.5657.1156547377.9609.help-gnu-emacs@gnu.org>
2 siblings, 0 replies; 17+ messages in thread
From: Grégory SCHMITT @ 2006-08-25 23:06 UTC (permalink / raw)
Le Sat, 26 Aug 2006 00:55:31 +0200, Peter Dyballa a écrit :
>
> Am 26.08.2006 um 00:06 schrieb Grégory SCHMITT:
>
> There won't be a perfect solution with GNU Emacs in the near future ...
>
>
> Is your Emacs copy installed from an RPM package or did you configure and
> compile yourself? For me UTF-8 and Emacs are too important to get it from
> somewhere, so I compile myself.
>
It's the standard RPM from Fedora. I will give a look at other versions of
emacs.
--
Grégory SCHMITT <mailto:gregory.schmitt@free.fr>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: UTF-8 in path / filename
[not found] ` <mailman.5656.1156546542.9609.help-gnu-emacs@gnu.org>
2006-08-25 23:06 ` Grégory SCHMITT
@ 2006-08-25 23:09 ` Miles Bader
2006-08-26 9:36 ` Peter Dyballa
[not found] ` <mailman.5657.1156547377.9609.help-gnu-emacs@gnu.org>
2 siblings, 1 reply; 17+ messages in thread
From: Miles Bader @ 2006-08-25 23:09 UTC (permalink / raw)
Peter Dyballa <Peter_Dyballa@Web.DE> writes:
> There won't be a perfect solution with GNU Emacs in the near future ...
You constantly seem to be having problems with UTF-8, but it works
absolutely perfectly for me, filenames, dired, everything (using emacs 22).
[It works perfectly even if I do `emacs -Q' to avoid loading my init
file, though I normally use (set-language-environment 'japanese).]
AFAIK the main thing is that your LANG environment variable be set to
something mentioning utf-8 -- I use "ja_JP.UTF-8".
If that doesn't work, I dunno, maybe it's something screwy about the mac.
-Miles
--
.Numeric stability is probably not all that important when you're guessing.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: UTF-8 in path / filename
[not found] ` <mailman.5657.1156547377.9609.help-gnu-emacs@gnu.org>
@ 2006-08-25 23:22 ` Grégory SCHMITT
2006-08-25 23:25 ` Miles Bader
0 siblings, 1 reply; 17+ messages in thread
From: Grégory SCHMITT @ 2006-08-25 23:22 UTC (permalink / raw)
Le Sat, 26 Aug 2006 08:09:25 +0900, Miles Bader a écrit :
> Peter Dyballa <Peter_Dyballa@Web.DE> writes:
>> There won't be a perfect solution with GNU Emacs in the near future ...
>
> You constantly seem to be having problems with UTF-8, but it works
> absolutely perfectly for me, filenames, dired, everything (using emacs
> 22).
Emacs 22 is said to be a MAJOR improvement for Utf. Wish I could get my
hands on a package release soon...
--
Grégory SCHMITT <mailto:gregory.schmitt@free.fr>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: UTF-8 in path / filename
2006-08-25 23:22 ` Grégory SCHMITT
@ 2006-08-25 23:25 ` Miles Bader
0 siblings, 0 replies; 17+ messages in thread
From: Miles Bader @ 2006-08-25 23:25 UTC (permalink / raw)
Grégory SCHMITT <gregory.schmitt@free.fr> writes:
> Emacs 22 is said to be a MAJOR improvement for Utf. Wish I could get my
> hands on a package release soon...
I'm sure there must be somebody out there maintaining RPMs for the
development version (in debian you can use the "emacs-snapshot" package;
there are also nicely packaged windows binaries out there).
-Miles
--
"Most attacks seem to take place at night, during a rainstorm, uphill,
where four map sheets join." -- Anon. British Officer in WW I
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: UTF-8 in path / filename
2006-08-25 23:09 ` Miles Bader
@ 2006-08-26 9:36 ` Peter Dyballa
2006-08-26 22:13 ` James Cloos
[not found] ` <mailman.5694.1156630455.9609.help-gnu-emacs@gnu.org>
0 siblings, 2 replies; 17+ messages in thread
From: Peter Dyballa @ 2006-08-26 9:36 UTC (permalink / raw)
Cc: help-gnu-emacs
Am 26.08.2006 um 01:09 schrieb Miles Bader:
> Peter Dyballa <Peter_Dyballa@Web.DE> writes:
>> There won't be a perfect solution with GNU Emacs in the near
>> future ...
>
> You constantly seem to be having problems with UTF-8, but it works
> absolutely perfectly for me, filenames, dired, everything (using
> emacs 22).
>
> [It works perfectly even if I do `emacs -Q' to avoid loading my init
> file, though I normally use (set-language-environment 'japanese).]
>
> AFAIK the main thing is that your LANG environment variable be set to
> something mentioning utf-8 -- I use "ja_JP.UTF-8".
>
pete 39 /\ .
/Users/pete
pete 40 /\ env | egrep -i 'LC|LANG'
LANG=de_DE.UTF-8
LC_CTYPE=de_DE.UTF-8
pete 41 /\ /usr/local/bin/emacs-22.0.50 -Q &
Files with UTF-8 characters in them are shown in dired (has -u: in
mode-line, i.e. uses UTF-8) à la <vowel><empty box>. Some UTF-8
characters like ß or Û show up as themselves. In the same manner they
appear in the buffer's mode-line, once visited, and also in the list
of buffers buffer (C-x b), completely unreadable in the Buffers menu
from menu bar and in another completely unreadable fashion in the
"Buffer Menu" pop-up. The font used for the vowels, the empty boxes,
or the other characters is taken from the Java SDK and quite rich
(1425 mapped characters for mostly European and some near eastern
scripts):
-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1 (#x61)
-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1 (#x308)
-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1 (#xDF)
-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1 (#x20AC)
Somehow this looks like a mixture of ISO 8859 characters (#x61, #xDF)
and Unicode (#x20AC) and something else (#x308) or are some
representations just abbreviations that leave away the 'leading zeros?'
The other information from C-u C-x = on the examples is:
character: a (97, #o141, #x61, U+0061)
charset: ascii (ASCII (ISO646 IRV))
code point: #x61
syntax: w which means: word
category: a:ASCII l:Latin
buffer code: #x61
file code: #x61 (encoded by coding system mule-utf-8)
character: (332488, #o1211310, #x512c8, U+0308)
charset: mule-unicode-0100-24ff (Unicode characters of the range
U+0100..U+24FF.)
code point: #x25 #x48
syntax: w which means: word
category: ^:Combining diacritic or mark
buffer code: #x9C #xF4 #xA5 #xC8
file code: #xCC #x88 (encoded by coding system mule-utf-8)
character: ß (2271, #o4337, #x8df, U+00DF)
charset: latin-iso8859-1 (Right-Hand Part of Latin Alphabet 1
(ISO/IEC 8859-1): ISO-IR-100.)
code point: #x5F
syntax: w which means: word
category: l:Latin
buffer code: #x81 #xDF
file code: #xC3 #x9F (encoded by coding system mule-utf-8)
character: Û (342604, #o1235114, #x53a4c, U+20AC)
charset: mule-unicode-0100-24ff (Unicode characters of the range
U+0100..U+24FF.)
code point: #x74 #x4C
syntax: w which means: word
buffer code: #x9C #xF4 #xF4 #xCC
file code: #xE2 #x82 #xAC (encoded by coding system mule-utf-8)
An excerpt from the fontset's description (I am missing ISO 8859-16!):
Fontset: -*-*-medium-r-*-*-10-*-*-*-m-*-fontset-startup
CHARSET or CHAR RANGE FONT NAME
--------------------- ---------
ascii -b&h-lucidatypewriter-medium-r-normal-sans-10-100-75-75-m-60-
iso10646-1
[-Adobe-Courier-Medium-R-Normal--10-100-75-75-M-60-ISO10646-1]
[-B&H-LucidaTypewriter-Bold-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1]
[-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1]
latin-iso8859-1 -b&h-lucidatypewriter-*-iso10646-1
[-B&H-LucidaTypewriter-Bold-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1]
[-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1]
latin-iso8859-2 -*-iso8859-2
latin-iso8859-3 -*-iso8859-3
latin-iso8859-4 -*-iso8859-4
thai-tis620 -*-*-*-tis620-*
greek-iso8859-7 -*-iso8859-7
arabic-iso8859-6 -*-iso8859-6
hebrew-iso8859-8 -*-iso8859-8
katakana-jisx0201 -*-jisx0201-*
latin-jisx0201 -*-jisx0201-*
cyrillic-iso8859-5 -*-iso8859-5
latin-iso8859-9 -*-iso8859-9
latin-iso8859-15 -*-iso8859-15
latin-iso8859-14 -*-iso8859-14
...
mule-unicode-2500-33ff -b&h-lucidatypewriter-*-iso10646-1
mule-unicode-e000-ffff -b&h-lucidatypewriter-*-iso10646-1
mule-unicode-0100-24ff -b&h-lucidatypewriter-*-iso10646-1
[-B&H-LucidaTypewriter-Bold-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1]
[-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1]
...
IMO the display of UTF-8 characters is not sufficient.
> If that doesn't work, I dunno, maybe it's something screwy about
> the mac.
>
There is something special, possibly screwy, in Mac OS X's (or
better: HFS+', the file system's) way to store UTF-8 characters in
file names: they get de-composed, i.e. an ä becomes a¨, an à becomes
a`, etc. (and only these, a file's contents does not get de-composed
how would such a JPEG picture look like?). So two or three octets
in the string on disk are expanded to a pair of one octet and
(mostly ?) two octets. GNU Emacs should be able to detect that: if a
character is from the category (see above) "Combining diacritic or
mark" it can't stand alone by nature, but must be combined with the
character on the left in a left to right writing system or with the
character on the right in a right to left writing system (I have no
idea of the rules in a top to bottom writing system like Mongolian
and whether these have combining characters). And it should be able
to handle the character categories correctly.
--
Greetings
Pete
What¹s the difference between OS X and Vista?
Microsoft employees are excited about OS X
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: UTF-8 in path / filename
2006-08-26 9:36 ` Peter Dyballa
@ 2006-08-26 22:13 ` James Cloos
2006-08-27 13:12 ` Peter Dyballa
[not found] ` <mailman.5694.1156630455.9609.help-gnu-emacs@gnu.org>
1 sibling, 1 reply; 17+ messages in thread
From: James Cloos @ 2006-08-26 22:13 UTC (permalink / raw)
Cc: help-gnu-emacs, Miles Bader
>>>>> "Peter" == Peter Dyballa <Peter_Dyballa@Web.DE> writes:
Peter> Files with UTF-8 characters in them are shown in dired (has -u: in
Peter> mode-line, i.e. uses UTF-8) à la <vowel><empty box>. Some UTF-8
Peter> characters like ß or Û show up as themselves.
Doesn't apple by default use NFD (Normalizaion Form Decomposed) for
filenames? That would explain the <vowel><box> sequences.
I suspect most others end up with NFC filenames. And composition
seems much better in the emacs-unicode-2 branch than in HEAD.
(But still not perfect. I sometimes get bad metrics on composed
glyphs; and sometimes they display as intended....)
Can you get at the actual octet-sequence of the filenames?
-JimC
--
James Cloos <cloos@jhcloos.com> OpenPGP: 0xED7DAEA6
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: UTF-8 in path / filename
[not found] ` <mailman.5694.1156630455.9609.help-gnu-emacs@gnu.org>
@ 2006-08-27 8:46 ` Harald Hanche-Olsen
0 siblings, 0 replies; 17+ messages in thread
From: Harald Hanche-Olsen @ 2006-08-27 8:46 UTC (permalink / raw)
+ James Cloos <cloos@jhcloos.com>:
| Doesn't apple by default use NFD (Normalizaion Form Decomposed) for
| filenames?
Seems you're right. See below.
| Can you get at the actual octet-sequence of the filenames?
I just now used TextEdit to creat a text with the filename
xxx-é-ï-ē-ĭ-ǫḥ.txt
(the xxx- prefix only so I could access it using wildcards in my shell)
; echo xxx-*.txt | od -t a
0000000 x x x - e cc 81 - i cc 88 - e cc 84 -
0000020 i cc 86 - o cc a8 h cc a3 . t x t nl
0000037
--
* Harald Hanche-Olsen <URL:http://www.math.ntnu.no/~hanche/>
- It is undesirable to believe a proposition
when there is no ground whatsoever for supposing it is true.
-- Bertrand Russell
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: UTF-8 in path / filename
2006-08-26 22:13 ` James Cloos
@ 2006-08-27 13:12 ` Peter Dyballa
2006-08-28 15:11 ` James Cloos
0 siblings, 1 reply; 17+ messages in thread
From: Peter Dyballa @ 2006-08-27 13:12 UTC (permalink / raw)
Cc: help-gnu-emacs, Miles Bader
Am 27.08.2006 um 00:13 schrieb James Cloos:
> Peter> Files with UTF-8 characters in them are shown in dired (has -
> u: in
> Peter> mode-line, i.e. uses UTF-8) à la <vowel><empty box>. Some
> UTF-8
> Peter> characters like ß or Û show up as themselves.
>
> Doesn't apple by default use NFD (Normalizaion Form Decomposed) for
> filenames? That would explain the <vowel><box> sequences.
Yes, that's the correct term for the way file names are recorded in
HFS+.
The font file, LucidaTypewriterRegular.ttf, has no combining
diacritical marks defined (only some modifiers), so these empty boxes
are displayed instead.
>
> Can you get at the actual octet-sequence of the filenames?
Do you know a tool that can do that? I can only think of a C
programme that reads the inode and than outputs the octets. Doing the
same as Harald did I get in Terminal different output (because UTF-8
characters are substituted with question marks, for example:
pete 140 /\ l -1 | grep .txt | grep ' ' | grep -v Mac
RGB äöüæÆÜÖÄ.txt
pete 141 /\ l -1 | grep .txt | grep ' ' | grep -v Mac | od -t a
R G B sp a ? 88 o ? 88 u ? 88 ? ? ?
86 U ? 88 O ? 88 A ? 88 . t x t nl
In Emacsen' shells I get:
R G B sp a \314 88 o \314 88 u \314 88
\303 \246 \303
86 U \314 88 O \314 88 A \314 88 . t x t nl
The file name áÛïǓà.txt is interpreted as:
a \314 81 U \314 82 i \314 88 U \314 8c a
\314 80 .
t x t nl
--
Greetings
Pete
"Isn't vi that text editor with two modes... one that beeps and one
that corrupts your file?" -- Dan Jacobson, on comp.os.linux.advocacy
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: UTF-8 in path / filename
2006-08-27 13:12 ` Peter Dyballa
@ 2006-08-28 15:11 ` James Cloos
2006-08-28 15:55 ` Peter Dyballa
0 siblings, 1 reply; 17+ messages in thread
From: James Cloos @ 2006-08-28 15:11 UTC (permalink / raw)
Cc: help-gnu-emacs, Miles Bader
JimC> Doesn't apple by default use NFD (Normalizaion Form Decomposed)
JimC> for filenames? That would explain the <vowel><box> sequences.
Peter> Yes, that's the correct term for the way file names are
Peter> recorded in HFS+.
So then the problem is narrowed to support for composition.
I just gave it a test, running the unicode-2 branch on a linux box,
using the en_US-UTF8 locale.
I copied the filename you quoted (äöüæÆÜÖÄ.txt), gave it a prefix to
ease globbing (resulting in /tmp/xxx-äöüæÆÜÖÄ.txt), and ran find-file
on /tmp. It worked correctly. (Well, almost; the glyphs composed by
emacs have twice the height of pre-composed glyphs. There was a time
when emacs didn't do that, but it is doing it again. Including in
this buffer. But that looks to be specific to --enable-font-backend
and DejaVu Sans Mono. With other fonts I do not get visible accents,
even though C-u C-x = claims it is composing. And without --e-f-b I
get composed glyphs which have correct vertical metrics.)
I also tested this:
:; echo /tmp/xxx-a*
and got the filename, showing that bash treats the code points as
separate characters when globbing. (Which also means I didn't
actually need the xxx- prefix, since a* will therefore match the
original filename....)
So. Does C-u C-x = claim to be composing for you?
-JimC
--
James Cloos <cloos@jhcloos.com> OpenPGP: 0xED7DAEA6
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: UTF-8 in path / filename
2006-08-28 15:11 ` James Cloos
@ 2006-08-28 15:55 ` Peter Dyballa
0 siblings, 0 replies; 17+ messages in thread
From: Peter Dyballa @ 2006-08-28 15:55 UTC (permalink / raw)
Cc: help-gnu-emacs, Miles Bader
Am 28.08.2006 um 17:11 schrieb James Cloos:
> So. Does C-u C-x = claim to be composing for you?
Yes, in GNU Emacs 23:
character: U (85, #o125, #x55)
preferred charset: ascii (ASCII (ISO646 IRV))
code point: 0x55
syntax: w which means: word
category: a:ASCII l:Latin r:Japanese roman
buffer code: #x55
file code: not encodable by coding system utf-8-unix
display: composed to form "Ü" (see below)
Unicode data:
Name: LATIN CAPITAL LETTER U
Category: Letter, Uppercase
Combining class: Lu
Bidi category: Lu
Lowercase: u
Composed with the following character(s) "¨" by the rule:
(?U (tc . bc) ?¨)
The component character(s) are displayed by these fonts (glyph codes):
U: -B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-
ISO8859-1 (#x55)
¨: -MUTT-ClearlyU-Medium-R-Normal--17-120-100-100-P-123-ISO10646-1
(#x308)
(Here you can see the reason for the large vertical composed
characters: a much too big font.)
In GNU Emacs 22.0.50 they are not composed, they are <vowel><accent>.
Instead of composing a character I would first try to find the pre-
composed form in the font(set) used. It surely would look much better.
--
Greetings
Pete
"We have to expect it, otherwise we would be surprised."
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2006-08-28 15:55 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-24 13:59 UTF-8 in path / filename Grégory SCHMITT
2006-08-24 14:42 ` Noah Slater
2006-08-25 12:08 ` Peter Dyballa
[not found] ` <mailman.5606.1156507702.9609.help-gnu-emacs@gnu.org>
2006-08-25 13:42 ` Grégory SCHMITT
2006-08-25 18:35 ` Peter Dyballa
2006-08-25 22:06 ` Grégory SCHMITT
2006-08-25 22:55 ` Peter Dyballa
[not found] ` <mailman.5656.1156546542.9609.help-gnu-emacs@gnu.org>
2006-08-25 23:06 ` Grégory SCHMITT
2006-08-25 23:09 ` Miles Bader
2006-08-26 9:36 ` Peter Dyballa
2006-08-26 22:13 ` James Cloos
2006-08-27 13:12 ` Peter Dyballa
2006-08-28 15:11 ` James Cloos
2006-08-28 15:55 ` Peter Dyballa
[not found] ` <mailman.5694.1156630455.9609.help-gnu-emacs@gnu.org>
2006-08-27 8:46 ` Harald Hanche-Olsen
[not found] ` <mailman.5657.1156547377.9609.help-gnu-emacs@gnu.org>
2006-08-25 23:22 ` Grégory SCHMITT
2006-08-25 23:25 ` Miles Bader
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.