From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Peter Dyballa Newsgroups: gmane.emacs.help Subject: Re: UTF-8 in path / filename Date: Sat, 26 Aug 2006 11:36:34 +0200 Message-ID: <0C15C504-B711-403E-B8D1-F03234C453E3@Web.DE> References: <7D07BEAB-2279-48C5-BB9A-3FF3A15D0FED@Web.DE> <20060826000627.b8b44e95.gregory.schmitt@free.fr> <87odu8ct0a.fsf@catnip.gol.com> NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 (Apple Message framework v752.2) Content-Type: text/plain; charset=X-MAC-ROMAN-LATIN1; delsp=yes; format=flowed Content-Transfer-Encoding: quoted-printable X-Trace: sea.gmane.org 1156585036 22302 80.91.229.2 (26 Aug 2006 09:37:16 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Sat, 26 Aug 2006 09:37:16 +0000 (UTC) Cc: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Sat Aug 26 11:37:14 2006 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by ciao.gmane.org with esmtp (Exim 4.43) id 1GGub3-000430-4D for geh-help-gnu-emacs@m.gmane.org; Sat, 26 Aug 2006 11:36:57 +0200 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1GGub2-0004XP-5C for geh-help-gnu-emacs@m.gmane.org; Sat, 26 Aug 2006 05:36:56 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1GGuan-0004Td-7M for help-gnu-emacs@gnu.org; Sat, 26 Aug 2006 05:36:41 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1GGuam-0004Rz-3k for help-gnu-emacs@gnu.org; Sat, 26 Aug 2006 05:36:40 -0400 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1GGual-0004Ra-SP for help-gnu-emacs@gnu.org; Sat, 26 Aug 2006 05:36:39 -0400 Original-Received: from [217.72.192.221] (helo=fmmailgate01.web.de) by monty-python.gnu.org with esmtp (Exim 4.52) id 1GGujD-00022l-QA; Sat, 26 Aug 2006 05:45:24 -0400 Original-Received: from smtp07.web.de (fmsmtp07.dlan.cinetic.de [172.20.5.215]) by fmmailgate01.web.de (Postfix) with ESMTP id 252E21763BE9; Sat, 26 Aug 2006 11:36:37 +0200 (CEST) Original-Received: from [84.245.185.26] (helo=[192.168.1.2]) by smtp07.web.de with asmtp (TLSv1:RC4-SHA:128) (WEB.DE 4.107 #114) id 1GGuai-0002li-00; Sat, 26 Aug 2006 11:36:37 +0200 In-Reply-To: <87odu8ct0a.fsf@catnip.gol.com> X-Image-Url: http://homepage.mac.com/sparifankal/.cv/thumbs/me.thumbnail Original-To: Miles Bader X-Mailer: Apple Mail (2.752.2) X-Sender: Peter_Dyballa@web.de X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:36946 Archived-At: Am 26.08.2006 um 01:09 schrieb Miles Bader: > Peter Dyballa writes: >> There won't be a perfect solution with GNU Emacs in the near =20 >> future ... > > You constantly seem to be having problems with UTF-8, but it works > absolutely perfectly for me, filenames, dired, everything (using =20 > emacs 22). > > [It works perfectly even if I do `emacs -Q' to avoid loading my init > file, though I normally use (set-language-environment 'japanese).] > > AFAIK the main thing is that your LANG environment variable be set to > something mentioning utf-8 -- I use "ja_JP.UTF-8". > pete 39 /\ . /Users/pete pete 40 /\ env | egrep -i 'LC|LANG' LANG=3Dde_DE.UTF-8 LC_CTYPE=3Dde_DE.UTF-8 pete 41 /\ /usr/local/bin/emacs-22.0.50 -Q & Files with UTF-8 characters in them are shown in dired (has -u: in =20 mode-line, i.e. uses UTF-8) =E0 la . Some UTF-8 =20 characters like =DF or =DB show up as themselves. In the same manner = they =20 appear in the buffer's mode-line, once visited, and also in the list =20 of buffers buffer (C-x b), completely unreadable in the Buffers menu =20 from menu bar and in another completely unreadable fashion in the =20 "Buffer Menu" pop-up. The font used for the vowels, the empty boxes, =20 or the other characters is taken from the Java SDK and quite rich =20 (1425 mapped characters for mostly European and some near eastern =20 scripts): -B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-=20 ISO10646-1 (#x61) -B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-=20 ISO10646-1 (#x308) -B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-=20 ISO10646-1 (#xDF) -B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-=20 ISO10646-1 (#x20AC) Somehow this looks like a mixture of ISO 8859 characters (#x61, #xDF) =20= and Unicode (#x20AC) and something else (#x308) =AD or are some =20 representations just abbreviations that leave away the 'leading zeros?' The other information from C-u C-x =3D on the examples is: character: a (97, #o141, #x61, U+0061) charset: ascii (ASCII (ISO646 IRV)) code point: #x61 syntax: w which means: word category: a:ASCII l:Latin buffer code: #x61 file code: #x61 (encoded by coding system mule-utf-8) character: (332488, #o1211310, #x512c8, U+0308) charset: mule-unicode-0100-24ff (Unicode characters of the range =20= U+0100..U+24FF.) code point: #x25 #x48 syntax: w which means: word category: ^:Combining diacritic or mark buffer code: #x9C #xF4 #xA5 #xC8 file code: #xCC #x88 (encoded by coding system mule-utf-8) character: =DF (2271, #o4337, #x8df, U+00DF) charset: latin-iso8859-1 (Right-Hand Part of Latin Alphabet 1 =20 (ISO/IEC 8859-1): ISO-IR-100.) code point: #x5F syntax: w which means: word category: l:Latin buffer code: #x81 #xDF file code: #xC3 #x9F (encoded by coding system mule-utf-8) character: =DB (342604, #o1235114, #x53a4c, U+20AC) charset: mule-unicode-0100-24ff (Unicode characters of the range =20= U+0100..U+24FF.) code point: #x74 #x4C syntax: w which means: word buffer code: #x9C #xF4 #xF4 #xCC file code: #xE2 #x82 #xAC (encoded by coding system mule-utf-8) An excerpt from the fontset's description (I am missing ISO 8859-16!): Fontset: -*-*-medium-r-*-*-10-*-*-*-m-*-fontset-startup CHARSET or CHAR RANGE FONT NAME --------------------- --------- ascii = -b&h-lucidatypewriter-medium-r-normal-sans-10-100-75-75-m-60-=20 iso10646-1 [-Adobe-Courier-Medium-R-Normal--10-100-75-75-M-60-ISO10646-1] [-B&H-LucidaTypewriter-Bold-R-Normal-Sans-10-100-75-75-M-60-=20 ISO10646-1] [-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-=20 ISO10646-1] latin-iso8859-1 -b&h-lucidatypewriter-*-iso10646-1 [-B&H-LucidaTypewriter-Bold-R-Normal-Sans-10-100-75-75-M-60-=20 ISO10646-1] [-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-=20 ISO10646-1] latin-iso8859-2 -*-iso8859-2 latin-iso8859-3 -*-iso8859-3 latin-iso8859-4 -*-iso8859-4 thai-tis620 -*-*-*-tis620-* greek-iso8859-7 -*-iso8859-7 arabic-iso8859-6 -*-iso8859-6 hebrew-iso8859-8 -*-iso8859-8 katakana-jisx0201 -*-jisx0201-* latin-jisx0201 -*-jisx0201-* cyrillic-iso8859-5 -*-iso8859-5 latin-iso8859-9 -*-iso8859-9 latin-iso8859-15 -*-iso8859-15 latin-iso8859-14 -*-iso8859-14 ... mule-unicode-2500-33ff -b&h-lucidatypewriter-*-iso10646-1 mule-unicode-e000-ffff -b&h-lucidatypewriter-*-iso10646-1 mule-unicode-0100-24ff -b&h-lucidatypewriter-*-iso10646-1 [-B&H-LucidaTypewriter-Bold-R-Normal-Sans-10-100-75-75-M-60-=20 ISO10646-1] [-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-=20 ISO10646-1] ... IMO the display of UTF-8 characters is not sufficient. > If that doesn't work, I dunno, maybe it's something screwy about =20 > the mac. > There is something special, possibly screwy, in Mac OS X's (or =20 better: HFS+', the file system's) way to store UTF-8 characters in =20 file names: they get de-composed, i.e. an =E4 becomes a=A8, an =E0 = becomes =20 a`, etc. (and only these, a file's contents does not get de-composed =20 =AD how would such a JPEG picture look like?). So two or three octets =20= in the string on disk are expanded to a pair of one octet and =20 (mostly ?) two octets. GNU Emacs should be able to detect that: if a =20 character is from the category (see above) "Combining diacritic or =20 mark" it can't stand alone by nature, but must be combined with the =20 character on the left in a left to right writing system or with the =20 character on the right in a right to left writing system (I have no =20 idea of the rules in a top to bottom writing system like Mongolian =AD =20= and whether these have combining characters). And it should be able =20 to handle the character categories correctly. -- Greetings Pete What=B9s the difference between OS X and Vista? Microsoft employees are excited about OS X=8A