all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* UTF-8 in path / filename
@ 2006-08-24 13:59 Grégory SCHMITT
  2006-08-24 14:42 ` Noah Slater
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Grégory SCHMITT @ 2006-08-24 13:59 UTC (permalink / raw)


Hi everyone,

I'm running emacs 21.4.1 using Linux (Fedora Core 5). When I try to open a
file and the path name contains UTF-8 letters, emacs won't be able to find
the file.

I create a folder called "Grégory". I put any file in it (let's call it
"test") and if I, from a simple xterm, try to do "emacs Grégory/test",
emacs won't be able to open the file. However, it will be successful if I
manually visit using C-x C-f.

If I use any other editor (such as mcedit), it will open OK.

Any explanation ?


-- 
Grégory SCHMITT <mailto:gregory.schmitt@free.fr>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: UTF-8 in path / filename
  2006-08-24 13:59 UTF-8 in path / filename Grégory SCHMITT
@ 2006-08-24 14:42 ` Noah Slater
  2006-08-25 12:08 ` Peter Dyballa
       [not found] ` <mailman.5606.1156507702.9609.help-gnu-emacs@gnu.org>
  2 siblings, 0 replies; 17+ messages in thread
From: Noah Slater @ 2006-08-24 14:42 UTC (permalink / raw)
  Cc: help-gnu-emacs

[-- Attachment #1: Type: text/plain, Size: 373 bytes --]

Grégory,

What is the command you are using? Perhaps xterm is configured
incorrectly and is mangling the file path before passing to Emacs.

What happens if you tab complete the file name in the shell?

Does the same happen with uxterm?

Thanks,
Noah


-- 
"Creativity can be a social contribution, but only in so
far as society is free to use the results." - R. Stallman

[-- Attachment #2: Type: text/plain, Size: 152 bytes --]

_______________________________________________
help-gnu-emacs mailing list
help-gnu-emacs@gnu.org
http://lists.gnu.org/mailman/listinfo/help-gnu-emacs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: UTF-8 in path / filename
  2006-08-24 13:59 UTF-8 in path / filename Grégory SCHMITT
  2006-08-24 14:42 ` Noah Slater
@ 2006-08-25 12:08 ` Peter Dyballa
       [not found] ` <mailman.5606.1156507702.9609.help-gnu-emacs@gnu.org>
  2 siblings, 0 replies; 17+ messages in thread
From: Peter Dyballa @ 2006-08-25 12:08 UTC (permalink / raw)
  Cc: help-gnu-emacs


Am 24.08.2006 um 15:59 schrieb Grégory SCHMITT:

> Hi everyone,
>
> I'm running emacs 21.4.1 using Linux (Fedora Core 5). When I try to  
> open a
> file and the path name contains UTF-8 letters, emacs won't be able  
> to find
> the file.
>
> I create a folder called "Grégory". I put any file in it (let's  
> call it
> "test") and if I, from a simple xterm, try to do "emacs Grégory/test",
> emacs won't be able to open the file. However, it will be  
> successful if I
> manually visit using C-x C-f.
>
> If I use any other editor (such as mcedit), it will open OK.
>
> Any explanation ?
>

Yes: your terminal emulation/shell swallows/hides information.

On Mac OS X in Apple's Terminal (TERM is xterm-color) I can see UTF-8  
filenames, for example äöüßÜÖÄ€. File name expansion/completion  
does *not* work on them (although RGB äöüæÆÜÖÄ.txt gets  
expanded to RGB a?^?o?^?u?^?æ?^?U?^?O?^?A?^?.txt). And of course it  
does not work to invoke GNU Emacs with this file name as argument (or  
'built-in' vi, nano. It *works* though when I do that from the  
*shell* buffer in Unicode Emacs 23.0.0 or GNU Emacs 22.0.50 ...  
(although no file name completion and the latter showing the ¨ as  
empty boxes in the file name) If I for example paste a name with  
UTF-8 contents from ls output to pass it to vi (it gives the best  
complaints) I can see that the de-composed UTF-8 characters are  
strangely interpreted. An ä seems to vanish and become kind of  
control character, the ¨ component of A¨, i.e. Ä, is passed as <cc>  
or such ...

Since in your case mcedit accepts the file name, mcedit and your  
terminal seem to use the same character encoding, so for both é *is*  
an é. GNU Emacs lives in its own world of almost indefinite character  
encodings. One way to make Emacs work correctly is to set environment  
variables like LC_All, LANG, or LC_CTYPE which obviously just repeat  
what your shell and your OS' standard utilities know. Next is *not*  
to set current-language-environment! From LC_CTYPE etc. Emacs learns  
what encodings to set for buffer contents, file names, process data.  
If it makes mistakes in this you might consider to use

	(prefer-coding-system           'iso-latin-9-unix)	; the one with €

or a few such statements with different codings each. GNU Emacs will  
then try to apply these encodings first. Since you're working with a  
non-Unicode Emacs you might need to set

	(unify-8859-on-decoding-mode t)
	(unify-8859-on-encoding-mode t)

to make the 8 bit ISO Latin encodings be handled as quite the same,  
i.e. é would be in any of these encodings in which it exists the  
same, i.e. you could search for it in all buffers and you only once  
told isearch to look for é.


One important thing is that *you* already messed up your .emacs file.  
Try to launch it also with --no-init-file and/or --no-site-file and  
also with -nw, i.e. running inside the terminal without X windows.

--
Greetings

   Pete

The most exciting phrase to hear in science, the one that heralds new  
discoveries, is not "Eureka!" (I found it!) but "That's funny..."
                                       Isaac Asimov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: UTF-8 in path / filename
       [not found] ` <mailman.5606.1156507702.9609.help-gnu-emacs@gnu.org>
@ 2006-08-25 13:42   ` Grégory SCHMITT
  2006-08-25 18:35     ` Peter Dyballa
  0 siblings, 1 reply; 17+ messages in thread
From: Grégory SCHMITT @ 2006-08-25 13:42 UTC (permalink / raw)


Le Fri, 25 Aug 2006 14:08:11 +0200, Peter Dyballa a écrit :

> One important thing is that *you* already messed up your .emacs file. Try
> to launch it also with --no-init-file and/or --no-site-file and also with
> -nw, i.e. running inside the terminal without X windows.

OK. I did it. I move my .emacs to another place, even though I never
really modified it. Still no success. For info, my locale is set as
LANG="fr_FR.UTF-8" (and that's all: no LC_TYPE... or other). My
terminal is a xterm (such as yours); I tried from the console, with
bash only, and that was still the same result.

If I set emacs to run in unibyte mode (with --unibyte on the command
line), it does work, but the file content (which is UTF-8) is parsed as
8859-15.


-- 
Grégory SCHMITT <mailto:gregory.schmitt@free.fr>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: UTF-8 in path / filename
  2006-08-25 13:42   ` Grégory SCHMITT
@ 2006-08-25 18:35     ` Peter Dyballa
  2006-08-25 22:06       ` Grégory SCHMITT
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Dyballa @ 2006-08-25 18:35 UTC (permalink / raw)
  Cc: help-gnu-emacs


Am 25.08.2006 um 15:42 schrieb Grégory SCHMITT:

> If I set emacs to run in unibyte mode (with --unibyte on the command
> line), it does work, but the file content (which is UTF-8) is  
> parsed as
> 8859-15.
>

This looks as if your system does not use UTF-8 ...

Can you create a file with accented characters? If not, can you put a  
copy of the file in the Grégory directory into your home or some  
other directory and invoke emacs, with or with no unibytes, with both  
files? In the first case the accented name would appear in the mode- 
line of the buffer (and would see what was passed or received as  
argument), in the latter case GNU Emacs would put the directory's  
name in the mode-line, I hope, to distinguish the two files with the  
same name. Again, you would see what was passed or received as  
"Grégory" ...

If the file names are or are not UTF-8, you can declare this  
in .emacs with:

	(setq default-file-name-coding-system 'utf-8)
	(setq default-file-name-coding-system 'iso-8859-15)

There are a lot more *coding-systems you can set ...

--
Greetings

   Pete
               <\
                 \__     O                       __O
                 | O\   _\\/\-%                _`\<,
                 '()-'-(_)--(_)               (_)/(_)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: UTF-8 in path / filename
  2006-08-25 18:35     ` Peter Dyballa
@ 2006-08-25 22:06       ` Grégory SCHMITT
  2006-08-25 22:55         ` Peter Dyballa
       [not found]         ` <mailman.5656.1156546542.9609.help-gnu-emacs@gnu.org>
  0 siblings, 2 replies; 17+ messages in thread
From: Grégory SCHMITT @ 2006-08-25 22:06 UTC (permalink / raw)
  Cc: help-gnu-emacs

> ----- Original Message -----
> Date: Fri, 25 Aug 2006 20:35:08 +0200
> From: Peter Dyballa <Peter_Dyballa@Web.DE>
> To: Grégory SCHMITT <gregory.schmitt@free.fr>
> Cc: help-gnu-emacs@gnu.org

> Subject: Re: UTF-8 in path / filename

> 
> Am 25.08.2006 um 15:42 schrieb Grégory SCHMITT:
> 
> > If I set emacs to run in unibyte mode (with --unibyte on the command
> > line), it does work, but the file content (which is UTF-8) is  
> > parsed as
> > 8859-15.
> >
> 
> This looks as if your system does not use UTF-8 ...

I thought Fedora uses UTF-8 by default.

> Can you create a file with accented characters? If not, can you put a  
> copy of the file in the Grégory directory into your home or some  
> other directory and invoke emacs, with or with no unibytes, with both  
> files? In the first case the accented name would appear in the mode- 
> line of the buffer (and would see what was passed or received as  
> argument), in the latter case GNU Emacs would put the directory's  
> name in the mode-line, I hope, to distinguish the two files with the  
> same name. Again, you would see what was passed or received as  
> "Grégory" ...
> 
> If the file names are or are not UTF-8, you can declare this  
> in .emacs with:
> 
> 	(setq default-file-name-coding-system 'utf-8)
> 	(setq default-file-name-coding-system 'iso-8859-15)

OK. So I have tow folders, "Greg" and "Grégory" in my home (ext3
filesystem, default options). I now have two file, "test" and "testé"
in each of them, plus in the current directory. Those files have the
same Utf-8 content, so I'm able to tell if they're parsed correctly or
not.

First case, with multibyte:
- both files in the "Greg" folder are visited correctly: file is
opened, content looks ok. However, the buffer name for "testé" appears
as "testÀ" (or sth like that), which in my mind is proof that the file
name is actually UTF-8 and displayed like ISO.
- both files in the "Grégory" folder are not visited. Manually visiting
the files works fine however, and the buffer name is correct ("testé" is
spelled correctly).File content is ok as well.
- both files in the current directory are visited ok, content is ok,
buffer name NOT ok.

Second, with unibyte:
- "Greg" folder: files visited ok, content NOT ok, buffer name NOT ok.
- "Grégory" folder: files visited ok, content NOT ok, buffer name NOT
ok.
- both files in the current directory are visited ok, content NOT ok,
buffer name NOT ok.

Hope that helps. As for me, I'm stuck...

-- 
Grégory SCHMITT <mailto:gregory.schmitt@free.fr>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: UTF-8 in path / filename
  2006-08-25 22:06       ` Grégory SCHMITT
@ 2006-08-25 22:55         ` Peter Dyballa
       [not found]         ` <mailman.5656.1156546542.9609.help-gnu-emacs@gnu.org>
  1 sibling, 0 replies; 17+ messages in thread
From: Peter Dyballa @ 2006-08-25 22:55 UTC (permalink / raw)
  Cc: help-gnu-emacs


Am 26.08.2006 um 00:06 schrieb Grégory SCHMITT:

> Hope that helps. As for me, I'm stuck...

I feel the same! All GNU Emacsen are not meant to handle UTF-8 as  
other applications can do. Unicode Emacs 23.0.0 behaves a bit better.

What you could try is to set default-buffer-file-coding-system to  
utf-8. It could also be that some preparation in file-coding-system- 
alist does not let you see UTF-8 contents, so check its value. I have  
in my customisation section

	'(unibyte-display-via-language-environment t)

and avoid set-language-environment.

There won't be a perfect solution with GNU Emacs in the near future ...


Is your Emacs copy installed from an RPM package or did you configure  
and compile yourself? For me UTF-8 and Emacs are too important to get  
it from somewhere, so I compile myself.

--
Greetings

   Pete

»¿ʇı̣ əsnqɐ ʇ,uɐɔ noʎ ɟı̣
ɓuı̣ɥʇʎuɐ sı̣ pooɓ ʇɐɥʍ«

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: UTF-8 in path / filename
       [not found]         ` <mailman.5656.1156546542.9609.help-gnu-emacs@gnu.org>
@ 2006-08-25 23:06           ` Grégory SCHMITT
  2006-08-25 23:09           ` Miles Bader
       [not found]           ` <mailman.5657.1156547377.9609.help-gnu-emacs@gnu.org>
  2 siblings, 0 replies; 17+ messages in thread
From: Grégory SCHMITT @ 2006-08-25 23:06 UTC (permalink / raw)


Le Sat, 26 Aug 2006 00:55:31 +0200, Peter Dyballa a écrit :

> 
> Am 26.08.2006 um 00:06 schrieb Grégory SCHMITT:
> 
> There won't be a perfect solution with GNU Emacs in the near future ...
> 
> 
> Is your Emacs copy installed from an RPM package or did you configure and
> compile yourself? For me UTF-8 and Emacs are too important to get it from
> somewhere, so I compile myself.
> 

It's the standard RPM from Fedora. I will give a look at other versions of
emacs.


-- 
Grégory SCHMITT <mailto:gregory.schmitt@free.fr>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: UTF-8 in path / filename
       [not found]         ` <mailman.5656.1156546542.9609.help-gnu-emacs@gnu.org>
  2006-08-25 23:06           ` Grégory SCHMITT
@ 2006-08-25 23:09           ` Miles Bader
  2006-08-26  9:36             ` Peter Dyballa
       [not found]           ` <mailman.5657.1156547377.9609.help-gnu-emacs@gnu.org>
  2 siblings, 1 reply; 17+ messages in thread
From: Miles Bader @ 2006-08-25 23:09 UTC (permalink / raw)


Peter Dyballa <Peter_Dyballa@Web.DE> writes:
> There won't be a perfect solution with GNU Emacs in the near future ...

You constantly seem to be having problems with UTF-8, but it works
absolutely perfectly for me, filenames, dired, everything (using emacs 22).

[It works perfectly even if I do `emacs -Q' to avoid loading my init
file, though I normally use (set-language-environment 'japanese).]

AFAIK the main thing is that your LANG environment variable be set to
something mentioning utf-8 -- I use "ja_JP.UTF-8".

If that doesn't work, I dunno, maybe it's something screwy about the mac.

-Miles

-- 
.Numeric stability is probably not all that important when you're guessing.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: UTF-8 in path / filename
       [not found]           ` <mailman.5657.1156547377.9609.help-gnu-emacs@gnu.org>
@ 2006-08-25 23:22             ` Grégory SCHMITT
  2006-08-25 23:25               ` Miles Bader
  0 siblings, 1 reply; 17+ messages in thread
From: Grégory SCHMITT @ 2006-08-25 23:22 UTC (permalink / raw)


Le Sat, 26 Aug 2006 08:09:25 +0900, Miles Bader a écrit :

> Peter Dyballa <Peter_Dyballa@Web.DE> writes:
>> There won't be a perfect solution with GNU Emacs in the near future ...
> 
> You constantly seem to be having problems with UTF-8, but it works
> absolutely perfectly for me, filenames, dired, everything (using emacs
> 22).

Emacs 22 is said to be a MAJOR improvement for Utf. Wish I could get my
hands on a package release soon...


-- 
Grégory SCHMITT <mailto:gregory.schmitt@free.fr>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: UTF-8 in path / filename
  2006-08-25 23:22             ` Grégory SCHMITT
@ 2006-08-25 23:25               ` Miles Bader
  0 siblings, 0 replies; 17+ messages in thread
From: Miles Bader @ 2006-08-25 23:25 UTC (permalink / raw)


Grégory SCHMITT <gregory.schmitt@free.fr> writes:
> Emacs 22 is said to be a MAJOR improvement for Utf. Wish I could get my
> hands on a package release soon...

I'm sure there must be somebody out there maintaining RPMs for the
development version (in debian you can use the "emacs-snapshot" package;
there are also nicely packaged windows binaries out there).

-Miles
-- 
"Most attacks seem to take place at night, during a rainstorm, uphill,
 where four map sheets join."   -- Anon. British Officer in WW I

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: UTF-8 in path / filename
  2006-08-25 23:09           ` Miles Bader
@ 2006-08-26  9:36             ` Peter Dyballa
  2006-08-26 22:13               ` James Cloos
       [not found]               ` <mailman.5694.1156630455.9609.help-gnu-emacs@gnu.org>
  0 siblings, 2 replies; 17+ messages in thread
From: Peter Dyballa @ 2006-08-26  9:36 UTC (permalink / raw)
  Cc: help-gnu-emacs


Am 26.08.2006 um 01:09 schrieb Miles Bader:

> Peter Dyballa <Peter_Dyballa@Web.DE> writes:
>> There won't be a perfect solution with GNU Emacs in the near  
>> future ...
>
> You constantly seem to be having problems with UTF-8, but it works
> absolutely perfectly for me, filenames, dired, everything (using  
> emacs 22).
>
> [It works perfectly even if I do `emacs -Q' to avoid loading my init
> file, though I normally use (set-language-environment 'japanese).]
>
> AFAIK the main thing is that your LANG environment variable be set to
> something mentioning utf-8 -- I use "ja_JP.UTF-8".
>

	pete 39 /\ .
	/Users/pete
	pete 40 /\ env | egrep -i 'LC|LANG'
	LANG=de_DE.UTF-8
	LC_CTYPE=de_DE.UTF-8
	pete 41 /\  /usr/local/bin/emacs-22.0.50 -Q &

Files with UTF-8 characters in them are shown in dired (has -u: in  
mode-line, i.e. uses UTF-8) à la <vowel><empty box>. Some UTF-8  
characters like ß or Û show up as themselves. In the same manner they  
appear in the buffer's mode-line, once visited, and also in the list  
of buffers buffer (C-x b), completely unreadable in the Buffers menu  
from menu bar and in another completely unreadable fashion in the  
"Buffer Menu" pop-up. The font used for the vowels, the empty boxes,  
or the other characters is taken from the Java SDK and quite rich  
(1425 mapped characters for mostly European and some near eastern  
scripts):

      -B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60- 
ISO10646-1 (#x61)
      -B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60- 
ISO10646-1 (#x308)
      -B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60- 
ISO10646-1 (#xDF)
      -B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60- 
ISO10646-1 (#x20AC)

Somehow this looks like a mixture of ISO 8859 characters (#x61, #xDF)  
and Unicode (#x20AC) and something else (#x308) ­ or are some  
representations just abbreviations that leave away the 'leading zeros?'

The other information from C-u C-x = on the examples is:

   character: a (97, #o141, #x61, U+0061)
     charset: ascii (ASCII (ISO646 IRV))
code point: #x61
      syntax: w 	which means: word
    category: a:ASCII l:Latin
buffer code: #x61
   file code: #x61 (encoded by coding system mule-utf-8)

   character:  (332488, #o1211310, #x512c8, U+0308)
     charset: mule-unicode-0100-24ff (Unicode characters of the range  
U+0100..U+24FF.)
code point: #x25 #x48
      syntax: w 	which means: word
    category: ^:Combining diacritic or mark
buffer code: #x9C #xF4 #xA5 #xC8
   file code: #xCC #x88 (encoded by coding system mule-utf-8)

   character: ß (2271, #o4337, #x8df, U+00DF)
     charset: latin-iso8859-1 (Right-Hand Part of Latin Alphabet 1  
(ISO/IEC 8859-1): ISO-IR-100.)
code point: #x5F
      syntax: w 	which means: word
    category: l:Latin
buffer code: #x81 #xDF
   file code: #xC3 #x9F (encoded by coding system mule-utf-8)

   character: Û (342604, #o1235114, #x53a4c, U+20AC)
     charset: mule-unicode-0100-24ff (Unicode characters of the range  
U+0100..U+24FF.)
code point: #x74 #x4C
      syntax: w 	which means: word
buffer code: #x9C #xF4 #xF4 #xCC
   file code: #xE2 #x82 #xAC (encoded by coding system mule-utf-8)

An excerpt from the fontset's description (I am missing ISO 8859-16!):

Fontset: -*-*-medium-r-*-*-10-*-*-*-m-*-fontset-startup
CHARSET or CHAR RANGE	FONT NAME
---------------------	---------
ascii			-b&h-lucidatypewriter-medium-r-normal-sans-10-100-75-75-m-60- 
iso10646-1
      [-Adobe-Courier-Medium-R-Normal--10-100-75-75-M-60-ISO10646-1]
      [-B&H-LucidaTypewriter-Bold-R-Normal-Sans-10-100-75-75-M-60- 
ISO10646-1]
      [-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60- 
ISO10646-1]
latin-iso8859-1		-b&h-lucidatypewriter-*-iso10646-1
      [-B&H-LucidaTypewriter-Bold-R-Normal-Sans-10-100-75-75-M-60- 
ISO10646-1]
      [-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60- 
ISO10646-1]
latin-iso8859-2		-*-iso8859-2
latin-iso8859-3		-*-iso8859-3
latin-iso8859-4		-*-iso8859-4
thai-tis620		-*-*-*-tis620-*
greek-iso8859-7		-*-iso8859-7
arabic-iso8859-6	-*-iso8859-6
hebrew-iso8859-8	-*-iso8859-8
katakana-jisx0201	-*-jisx0201-*
latin-jisx0201		-*-jisx0201-*
cyrillic-iso8859-5	-*-iso8859-5
latin-iso8859-9		-*-iso8859-9
latin-iso8859-15	-*-iso8859-15
latin-iso8859-14	-*-iso8859-14
...
mule-unicode-2500-33ff	-b&h-lucidatypewriter-*-iso10646-1
mule-unicode-e000-ffff	-b&h-lucidatypewriter-*-iso10646-1
mule-unicode-0100-24ff	-b&h-lucidatypewriter-*-iso10646-1
      [-B&H-LucidaTypewriter-Bold-R-Normal-Sans-10-100-75-75-M-60- 
ISO10646-1]
      [-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60- 
ISO10646-1]
...

IMO the display of UTF-8 characters is not sufficient.


> If that doesn't work, I dunno, maybe it's something screwy about  
> the mac.
>

There is something special, possibly screwy, in Mac OS X's (or  
better: HFS+', the file system's) way to store UTF-8 characters in  
file names: they get de-composed, i.e. an ä becomes a¨, an à becomes  
a`, etc. (and only these, a file's contents does not get de-composed  
­ how would such a JPEG picture look like?). So two or three octets  
in the string on disk are expanded to a pair of one octet and  
(mostly ?) two octets. GNU Emacs should be able to detect that: if a  
character is from the category (see above) "Combining diacritic or  
mark" it can't stand alone by nature, but must be combined with the  
character on the left in a left to right writing system or with the  
character on the right in a right to left writing system (I have no  
idea of the rules in a top to bottom writing system like Mongolian ­  
and whether these have combining characters). And it should be able  
to handle the character categories correctly.

--
Greetings

   Pete

What¹s the difference between OS X and Vista?

Microsoft employees are excited about OS XŠ

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: UTF-8 in path / filename
  2006-08-26  9:36             ` Peter Dyballa
@ 2006-08-26 22:13               ` James Cloos
  2006-08-27 13:12                 ` Peter Dyballa
       [not found]               ` <mailman.5694.1156630455.9609.help-gnu-emacs@gnu.org>
  1 sibling, 1 reply; 17+ messages in thread
From: James Cloos @ 2006-08-26 22:13 UTC (permalink / raw)
  Cc: help-gnu-emacs, Miles Bader

>>>>> "Peter" == Peter Dyballa <Peter_Dyballa@Web.DE> writes:

Peter> Files with UTF-8 characters in them are shown in dired (has -u: in
Peter> mode-line, i.e. uses UTF-8) à la <vowel><empty box>. Some UTF-8
Peter> characters like ß or Û show up as themselves.

Doesn't apple by default use NFD (Normalizaion Form Decomposed) for
filenames?  That would explain the <vowel><box> sequences.

I suspect most others end up with NFC filenames.  And composition
seems much better in the emacs-unicode-2 branch than in HEAD.
(But still not perfect.  I sometimes get bad metrics on composed
glyphs; and sometimes they display as intended....)

Can you get at the actual octet-sequence of the filenames?

-JimC
-- 
James Cloos <cloos@jhcloos.com>         OpenPGP: 0xED7DAEA6

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: UTF-8 in path / filename
       [not found]               ` <mailman.5694.1156630455.9609.help-gnu-emacs@gnu.org>
@ 2006-08-27  8:46                 ` Harald Hanche-Olsen
  0 siblings, 0 replies; 17+ messages in thread
From: Harald Hanche-Olsen @ 2006-08-27  8:46 UTC (permalink / raw)


+ James Cloos <cloos@jhcloos.com>:

| Doesn't apple by default use NFD (Normalizaion Form Decomposed) for
| filenames?

Seems you're right.  See below.

| Can you get at the actual octet-sequence of the filenames?

I just now used TextEdit to creat a text with the filename

xxx-é-ï-ē-ĭ-ǫḥ.txt

(the xxx- prefix only so I could access it using wildcards in my shell)

; echo xxx-*.txt | od -t a
0000000    x   x   x   -   e  cc  81   -   i  cc  88   -   e  cc  84   -
0000020    i  cc  86   -   o  cc  a8   h  cc  a3   .   t   x   t  nl    
0000037

-- 
* Harald Hanche-Olsen     <URL:http://www.math.ntnu.no/~hanche/>
- It is undesirable to believe a proposition
  when there is no ground whatsoever for supposing it is true.
  -- Bertrand Russell

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: UTF-8 in path / filename
  2006-08-26 22:13               ` James Cloos
@ 2006-08-27 13:12                 ` Peter Dyballa
  2006-08-28 15:11                   ` James Cloos
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Dyballa @ 2006-08-27 13:12 UTC (permalink / raw)
  Cc: help-gnu-emacs, Miles Bader


Am 27.08.2006 um 00:13 schrieb James Cloos:

> Peter> Files with UTF-8 characters in them are shown in dired (has - 
> u: in
> Peter> mode-line, i.e. uses UTF-8) à la <vowel><empty box>. Some  
> UTF-8
> Peter> characters like ß or Û show up as themselves.
>
> Doesn't apple by default use NFD (Normalizaion Form Decomposed) for
> filenames?  That would explain the <vowel><box> sequences.

Yes, that's the correct term for the way file names are recorded in  
HFS+.

The font file, LucidaTypewriterRegular.ttf, has no combining  
diacritical marks defined (only some modifiers), so these empty boxes  
are displayed instead.

>
> Can you get at the actual octet-sequence of the filenames?

Do you know a tool that can do that? I can only think of a C  
programme that reads the inode and than outputs the octets. Doing the  
same as Harald did I get in Terminal different output (because UTF-8  
characters are substituted with question marks, for example:

	pete 140 /\ l -1 | grep .txt | grep ' ' | grep -v Mac
	RGB äöüæÆÜÖÄ.txt
	pete 141 /\ l -1 | grep .txt | grep ' ' | grep -v Mac | od -t a
	    R   G   B  sp   a   ?  88   o   ?  88   u   ?  88   ?   ?   ?
	   86   U   ?  88   O   ?  88   A   ?  88   .   t   x   t  nl

In Emacsen' shells I get:

	    R   G   B  sp   a   \314  88   o   \314  88   u   \314  88    
\303   \246   \303
	   86   U   \314  88   O   \314  88   A   \314  88   .   t   x   t  nl

The file name áÛïǓà.txt is interpreted as:

	    a   \314  81   U   \314  82   i   \314  88   U   \314  8c   a    
\314  80   .
	    t   x   t  nl

--
Greetings

   Pete

"Isn't vi that text editor with two modes... one that beeps and one
that corrupts your file?" -- Dan Jacobson, on comp.os.linux.advocacy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: UTF-8 in path / filename
  2006-08-27 13:12                 ` Peter Dyballa
@ 2006-08-28 15:11                   ` James Cloos
  2006-08-28 15:55                     ` Peter Dyballa
  0 siblings, 1 reply; 17+ messages in thread
From: James Cloos @ 2006-08-28 15:11 UTC (permalink / raw)
  Cc: help-gnu-emacs, Miles Bader

JimC> Doesn't apple by default use NFD (Normalizaion Form Decomposed)
JimC> for filenames?  That would explain the <vowel><box> sequences.

Peter> Yes, that's the correct term for the way file names are
Peter> recorded in HFS+.

So then the problem is narrowed to support for composition.

I just gave it a test, running the unicode-2 branch on a linux box,
using the en_US-UTF8 locale.

I copied the filename you quoted (äöüæÆÜÖÄ.txt), gave it a prefix to
ease globbing (resulting in /tmp/xxx-äöüæÆÜÖÄ.txt), and ran find-file
on /tmp.  It worked correctly.  (Well, almost; the glyphs composed by
emacs have twice the height of pre-composed glyphs.  There was a time
when emacs didn't do that, but it is doing it again.  Including in
this buffer.  But that looks to be specific to --enable-font-backend
and DejaVu Sans Mono.  With other fonts I do not get visible accents,
even though C-u C-x = claims it is composing.  And without --e-f-b I
get composed glyphs which have correct vertical metrics.)

I also tested this:

  :; echo /tmp/xxx-a*

and got the filename, showing that bash treats the code points as
separate characters when globbing.  (Which also means I didn't
actually need the xxx- prefix, since a* will therefore match the
original filename....)

So.  Does C-u C-x = claim to be composing for you?

-JimC
-- 
James Cloos <cloos@jhcloos.com>         OpenPGP: 0xED7DAEA6

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: UTF-8 in path / filename
  2006-08-28 15:11                   ` James Cloos
@ 2006-08-28 15:55                     ` Peter Dyballa
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Dyballa @ 2006-08-28 15:55 UTC (permalink / raw)
  Cc: help-gnu-emacs, Miles Bader


Am 28.08.2006 um 17:11 schrieb James Cloos:

> So.  Does C-u C-x = claim to be composing for you?

Yes, in GNU Emacs 23:

         character: U (85, #o125, #x55)
preferred charset: ascii (ASCII (ISO646 IRV))
        code point: 0x55
            syntax: w 	which means: word
          category: a:ASCII l:Latin r:Japanese roman
       buffer code: #x55
         file code: not encodable by coding system utf-8-unix
           display: composed to form "Ü" (see below)
      Unicode data:
              Name: LATIN CAPITAL LETTER U
          Category: Letter, Uppercase
   Combining class: Lu
     Bidi category: Lu
         Lowercase: u

Composed with the following character(s) "¨" by the rule:
	(?U (tc . bc) ?¨)
The component character(s) are displayed by these fonts (glyph codes):
U: -B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60- 
ISO8859-1 (#x55)
¨: -MUTT-ClearlyU-Medium-R-Normal--17-120-100-100-P-123-ISO10646-1  
(#x308)

(Here you can see the reason for the large vertical composed  
characters: a much too big font.)

In GNU Emacs 22.0.50 they are not composed, they are <vowel><accent>.


Instead of composing a character I would first try to find the pre- 
composed form in the font(set) used. It surely would look much better.

--
Greetings

   Pete

"We have to expect it, otherwise we would be surprised."

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2006-08-28 15:55 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-24 13:59 UTF-8 in path / filename Grégory SCHMITT
2006-08-24 14:42 ` Noah Slater
2006-08-25 12:08 ` Peter Dyballa
     [not found] ` <mailman.5606.1156507702.9609.help-gnu-emacs@gnu.org>
2006-08-25 13:42   ` Grégory SCHMITT
2006-08-25 18:35     ` Peter Dyballa
2006-08-25 22:06       ` Grégory SCHMITT
2006-08-25 22:55         ` Peter Dyballa
     [not found]         ` <mailman.5656.1156546542.9609.help-gnu-emacs@gnu.org>
2006-08-25 23:06           ` Grégory SCHMITT
2006-08-25 23:09           ` Miles Bader
2006-08-26  9:36             ` Peter Dyballa
2006-08-26 22:13               ` James Cloos
2006-08-27 13:12                 ` Peter Dyballa
2006-08-28 15:11                   ` James Cloos
2006-08-28 15:55                     ` Peter Dyballa
     [not found]               ` <mailman.5694.1156630455.9609.help-gnu-emacs@gnu.org>
2006-08-27  8:46                 ` Harald Hanche-Olsen
     [not found]           ` <mailman.5657.1156547377.9609.help-gnu-emacs@gnu.org>
2006-08-25 23:22             ` Grégory SCHMITT
2006-08-25 23:25               ` Miles Bader

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.