unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#4157: 23.1.50; faulty character characterisation for ä
@ 2009-08-16  2:19 Peter Dyballa
  2009-08-18  1:09 ` Kenichi Handa
                   ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Peter Dyballa @ 2009-08-16  2:19 UTC (permalink / raw)
  To: emacs-pretest-bug

[-- Attachment #1: Type: text/plain, Size: 911 bytes --]

Hello!

When I launch GNU Emacs in an ISO Latin environment (env  
LC_CTYPE=de_DE.ISO8859-15 LANG=de_DE.ISO8859-15 /usr/local/bin/ 
emacs-23.1.50 -Q &) and display in dired a directory with entries  
from some month of March the "Mär" abbrevation for the German month  
name "März" is displayed as M\344r. C-u C-x = on this \344 reveals:

	        character: \344 (4194276, #o17777744, #x3fffe4)
	preferred charset: eight-bit (Raw bytes 128-255)
	       code point: 0xE4
	           syntax: w 	which means: word
	      buffer code: #xE4
	        file code: not encodable by coding system iso-latin-9-unix
	          display: no font available

The dired buffer has a 0 as encoding indicator. In ISO Latin 1 or 15  
encodings LATIN SMALL LETTER A WITH DIAERESIS is \344 = 228 = 0xE4 = U 
+00E4 a valid character and not some raw "eight-bit" entity. Could be  
this prevents proper display:


[-- Attachment #2: pastedGraphic.tiff --]
[-- Type: image/tiff, Size: 9998 bytes --]

[-- Attachment #3: Type: text/plain, Size: 60 bytes --]



In *shell* buffer both Apple's ls and GNU's gls display:


[-- Attachment #4: pastedGraphic.tiff --]
[-- Type: image/tiff, Size: 9152 bytes --]

[-- Attachment #5: Type: text/plain, Size: 2511 bytes --]



Here the ä is described as:

	        character: ä (228, #o344, #xe4)
	preferred charset: iso-8859-15 (ISO/IEC 8859/15)
	       code point: 0xE4
	           syntax: w 	which means: word
	         category: .:Base, j:Japanese, l:Latin
	      buffer code: #xC3 #xA4
	        file code: #xE4 (encoded by coding system iso-latin-9-unix)
	          display: by this font (glyph code)
	    x:-b&h-lucidatypewriter-medium-r-normal-sans-10-100-75-75-m-60- 
iso10646-1 (#xE4)

The buffer's encoding is "0" as well, i.e., ISO Latin 1 or 15.

BTW, the issue is correct in UTF-8 environment.abbreviation


In GNU Emacs 23.1.50.1 (powerpc-apple-darwin8.11.0, X toolkit, Xaw3d  
scroll bars)
  of 2009-07-30 on Latsche.local
Windowing system distributor `The XFree86 Project, Inc', version  
11.0.40400000
configured using `configure  '--without-sound' '--without-pop' '-- 
with-dbus' '--with-libotf' '--with-x-toolkit=athena' '--x-includes=/ 
usr/X11R6/include' '--x-libraries=/usr/X11R6/lib' '--enable- 
locallisppath=/Library/Application Support/Emacs/calendar23:/Library/ 
Application Support/Emacs' 'CPPFLAGS=-no-cpp-precomp -I/sw/include -I/ 
sw/lib/pango-ft219/include/pango-1.0 -idirafter /usr/X11R6/include'  
'CFLAGS=-ggdb3 -gfull -mtraceback=full -Wno-pointer-sign -H -pipe - 
fPIC -mcpu=7450 -mtune=7450 -fast -mpim-altivec -ftree-vectorize - 
foptimize-register-move -freorder-blocks -fthread-jumps -fpeephole - 
fno-crossjumping' 'LDFLAGS=-dead_strip -multiply_defined suppress -L/ 
sw/lib''

Important settings:
   value of $LC_ALL: nil
   value of $LC_COLLATE: nil
   value of $LC_CTYPE: de_DE.ISO8859-15
   value of $LC_MESSAGES: nil
   value of $LC_MONETARY: nil
   value of $LC_NUMERIC: nil
   value of $LC_TIME: nil
   value of $LANG: de_DE.ISO8859-15
   value of $XMODIFIERS: nil
   locale-coding-system: iso-latin-9-unix
   default-enable-multibyte-characters: t

Major mode: Dired by name

Minor modes in effect:
   shell-dirtrack-mode: t
   show-paren-mode: t
   display-time-mode: t
   tooltip-mode: t
   tool-bar-mode: t
   mouse-wheel-mode: t
   file-name-shadow-mode: t
   global-font-lock-mode: t
   font-lock-mode: t
   blink-cursor-mode: t
   global-auto-composition-mode: t
   auto-composition-mode: t
   auto-encryption-mode: t
   auto-compression-mode: t
   column-number-mode: t
   line-number-mode: t
   transient-mark-mode: t

--
Greetings

   Pete

If you're not confused, you're not paying attention.




^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2009-08-16  2:19 bug#4157: 23.1.50; faulty character characterisation for ä Peter Dyballa
@ 2009-08-18  1:09 ` Kenichi Handa
  2009-08-18 13:40   ` Peter Dyballa
  2009-08-22  4:09 ` Stefan Monnier
  2019-10-09 14:29 ` Stefan Kangas
  2 siblings, 1 reply; 47+ messages in thread
From: Kenichi Handa @ 2009-08-18  1:09 UTC (permalink / raw)
  To: Peter Dyballa, 4157

In article <57B19222-57FF-40C8-8C94-8D19E1281D14@Freenet.DE>, Peter Dyballa <Peter_Dyballa@Freenet.DE> writes:

> When I launch GNU Emacs in an ISO Latin environment (env  
> LC_CTYPE=de_DE.ISO8859-15 LANG=de_DE.ISO8859-15 /usr/local/bin/ 
> emacs-23.1.50 -Q &) and display in dired a directory with entries  
> from some month of March the "Mär" abbrevation for the German month  
> name "März" is displayed as M\344r. C-u C-x = on this \344 reveals:

> 	        character: \344 (4194276, #o17777744, #x3fffe4)
> 	preferred charset: eight-bit (Raw bytes 128-255)
> 	       code point: 0xE4
> 	           syntax: w 	which means: word
> 	      buffer code: #xE4
> 	        file code: not encodable by coding system =
> iso-latin-9-unix
> 	          display: no font available

> The dired buffer has a 0 as encoding indicator. In ISO Latin 1 or 15  
> encodings LATIN SMALL LETTER A WITH DIAERESIS is \344 = 228 = 0xE4 =>  U 
> +00E4 a valid character and not some raw "eight-bit" entity. Could be  
> this prevents proper display:

Please show the value of default-file-name-coding-system and
file-name-coding-system.

---
Kenichi Handa
handa@m17n.org





^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2009-08-18  1:09 ` Kenichi Handa
@ 2009-08-18 13:40   ` Peter Dyballa
  2009-08-19  0:23     ` bug#4157: " Kenichi Handa
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Dyballa @ 2009-08-18 13:40 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: 4157


Am 18.08.2009 um 03:09 schrieb Kenichi Handa:

>
> Please show the value of default-file-name-coding-system and
> file-name-coding-system.
>

I (seem to) see: it's utf-8 for the first and nil for the second  
variable (the same as globally). So the string M\344r, coming from  
some ls which follows LC_CTYPE or LANG, is interpreted as being UTF-8  
which it of course isn't...

--
Greetings

   Pete

Every instructor assumes that you have nothing else to do except  
study for that instructor's course.
				– Fourth Law of Applied Terror






^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: Re: bug#4157: 23.1.50; faulty character characterisation for ä
  2009-08-18 13:40   ` Peter Dyballa
@ 2009-08-19  0:23     ` Kenichi Handa
  2009-08-19 22:47       ` Peter Dyballa
  2009-08-24 11:30       ` Peter Dyballa
  0 siblings, 2 replies; 47+ messages in thread
From: Kenichi Handa @ 2009-08-19  0:23 UTC (permalink / raw)
  To: Peter Dyballa; +Cc: 4157

In article <14A765B4-9EAF-46AC-BEBC-6B0A664BA03A@Freenet.DE>, Peter Dyballa <Peter_Dyballa@Freenet.DE> writes:

> > Please show the value of default-file-name-coding-system and
> > file-name-coding-system.
> >

> I (seem to) see: it's utf-8 for the first and nil for the second  
> variable (the same as globally). So the string M\344r, coming from  
> some ls which follows LC_CTYPE or LANG, is interpreted as being UTF-8  
> which it of course isn't...

Ah, I found this code in mule-cmds.el.

  (if (eq system-type 'darwin)
      ;; The file-name coding system on Darwin systems is always utf-8.
      (setq default-file-name-coding-system 'utf-8)

I don't remember why that code exists.  If the comment is
wrong (i.e. there's no need of treating darwin specially
here), the attached patch should solve the problem.  Please
try it.

---
Kenichi Handa
handa@m17n.org

--- mule-cmds.el.~1.364.~	2009-08-13 20:59:18.000000000 +0900
+++ mule-cmds.el	2009-08-19 09:21:33.000000000 +0900
@@ -355,13 +355,10 @@
 	(or (local-variable-p 'buffer-file-coding-system buffer)
 	    (ucs-set-table-for-input buffer))))
 
-  (if (eq system-type 'darwin)
-      ;; The file-name coding system on Darwin systems is always utf-8.
-      (setq default-file-name-coding-system 'utf-8)
-    (if (and default-enable-multibyte-characters
-	     (or (not coding-system)
-		 (coding-system-get coding-system 'ascii-compatible-p)))
-	(setq default-file-name-coding-system coding-system)))
+  (if (and default-enable-multibyte-characters
+	   (or (not coding-system)
+	       (coding-system-get coding-system 'ascii-compatible-p)))
+      (setq default-file-name-coding-system coding-system))
   (setq default-terminal-coding-system coding-system)
   (setq default-keyboard-coding-system coding-system)
   ;; Preserve eol-type from existing default-process-coding-systems.





^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2009-08-19  0:23     ` bug#4157: " Kenichi Handa
@ 2009-08-19 22:47       ` Peter Dyballa
  2009-08-24 11:30       ` Peter Dyballa
  1 sibling, 0 replies; 47+ messages in thread
From: Peter Dyballa @ 2009-08-19 22:47 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: 4157


Am 19.08.2009 um 02:23 schrieb Kenichi Handa:

> the attached patch should solve the problem


It will take some time until I actually could test the patch. Changes  
in the configure script lead to -I/usr/X11R6/include coming so early  
that compilation with libfreetype fails:

xftfont.c: In function ‘xftfont_open’:
xftfont.c:220: error: ‘FC_WIDTH’ undeclared (first use in this function)
xftfont.c:220: error: (Each undeclared identifier is reported only once
xftfont.c:220: error: for each function it appears in.)
xftfont.c:259: error: ‘FC_HINT_STYLE’ undeclared (first use in this  
function)
make[2]: *** [xftfont.o] Error 1

I need to find a workaround, or compile without libfreetype.

--
Greetings

   Pete

UNIX is user friendly, it's just picky about who its friends are.








^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2009-08-16  2:19 bug#4157: 23.1.50; faulty character characterisation for ä Peter Dyballa
  2009-08-18  1:09 ` Kenichi Handa
@ 2009-08-22  4:09 ` Stefan Monnier
  2009-08-22  8:50   ` Peter Dyballa
  2019-10-09 14:29 ` Stefan Kangas
  2 siblings, 1 reply; 47+ messages in thread
From: Stefan Monnier @ 2009-08-22  4:09 UTC (permalink / raw)
  To: Peter Dyballa; +Cc: 4157

> When I launch GNU Emacs in an ISO Latin environment (env
> LC_CTYPE=de_DE.ISO8859-15 LANG=de_DE.ISO8859-15 /usr/local/bin/
> emacs-23.1.50 -Q &) and display in dired a directory with entries  from some
> month of March the "Mär" abbrevation for the German month  name "März" is
> displayed as M\344r. C-u C-x = on this \344 reveals:

Hmm... that looks like a problem in dired: the file names in the output
of `ls' should follow file-name-coding-system, whereas the rest of the
output seem to use locale-coding-system.  Coudl you check if that's
indeed the case:
- create a file from the Finder using accented latin-1
  chars, as well as non-latin-1 chars).
- look at it in your dired and tell us what you see.

On a Darwin system, I very warmly recommend to stick to utf-8 coding
systems for everything, since it should avoid such problems.


        Stefan





^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2009-08-22  4:09 ` Stefan Monnier
@ 2009-08-22  8:50   ` Peter Dyballa
  2009-08-23  1:49     ` Stefan Monnier
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Dyballa @ 2009-08-22  8:50 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 4157


Am 22.08.2009 um 06:09 schrieb Stefan Monnier:

>> When I launch GNU Emacs in an ISO Latin environment (env
>> LC_CTYPE=de_DE.ISO8859-15 LANG=de_DE.ISO8859-15 /usr/local/bin/
>> emacs-23.1.50 -Q &) and display in dired a directory with entries   
>> from some
>> month of March the "Mär" abbrevation for the German month  name  
>> "März" is
>> displayed as M\344r. C-u C-x = on this \344 reveals:
>
> Hmm... that looks like a problem in dired: the file names in the  
> output
> of `ls' should follow file-name-coding-system, whereas the rest of the
> output seem to use locale-coding-system.  Coudl you check if that's
> indeed the case:
> - create a file from the Finder using accented latin-1
>   chars, as well as non-latin-1 chars).
> - look at it in your dired and tell us what you see.

In both locales the *file names* are correct and also detected as  
containing "composed characters," it's a problem with the file's  
month date. In the ISO-Latin encoding the ä character is not  
recognised as that entity and part of the ISO Latin encoding, but as  
something strange which can only be displayed in octal  
representation. In UTF-8 this does not happen.

Isearch for example finds the \344 characters only as C-s C-q 3 4 4  
RET, while the character ä cannot be found (because composed, of a  
and ¨ and therefore a search for a succeeds, in both locales).

>
> On a Darwin system, I very warmly recommend to stick to utf-8 coding
> systems for everything, since it should avoid such problems.
>

Yes, you're right. I was trying to work on a problem with a "Carbon  
Emacs" ...

--
Greetings

   Pete

One cannot live by television, video games, top ten CDs, and dumb  
movies alone.
				– Amiri Baraka, 1999








^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2009-08-22  8:50   ` Peter Dyballa
@ 2009-08-23  1:49     ` Stefan Monnier
  2009-08-23  9:57       ` Peter Dyballa
  0 siblings, 1 reply; 47+ messages in thread
From: Stefan Monnier @ 2009-08-23  1:49 UTC (permalink / raw)
  To: Peter Dyballa; +Cc: 4157

>> Hmm... that looks like a problem in dired: the file names in the output
>> of `ls' should follow file-name-coding-system, whereas the rest of the
>> output seem to use locale-coding-system.  Coudl you check if that's
>> indeed the case:
>> - create a file from the Finder using accented latin-1
>> chars, as well as non-latin-1 chars).
>> - look at it in your dired and tell us what you see.
> In both locales the *file names* are correct and also detected as containing

"correct" doesn't really tell me what you see, but I see what you mean.

> "composed characters," it's a problem with the file's  month date. In the

So my guess was right: ls's output uses utf-8 for the filenames, but
latin-1 for the date, which is why it's difficult for dired to do the
right thing (it's not impossible, of course, but it's more work and
dired is currently not setup for that).


        Stefan





^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2009-08-23  1:49     ` Stefan Monnier
@ 2009-08-23  9:57       ` Peter Dyballa
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Dyballa @ 2009-08-23  9:57 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 4157


Am 23.08.2009 um 03:49 schrieb Stefan Monnier:

>> In both locales the *file names* are correct and also detected as  
>> containing
>
> "correct" doesn't really tell me what you see, but I see what you  
> mean.

"Correct" meant that I was seeing what I had typed before in Finder...

>
>> "composed characters," it's a problem with the file's  month date.  
>> In the
>
> So my guess was right: ls's output uses utf-8 for the filenames, but
> latin-1 for the date, which is why it's difficult for dired to do the
> right thing (it's not impossible, of course, but it's more work and
> dired is currently not setup for that).
>

Here is a little test from a shell (actually *shell* buffer in NS  
Emacs.app with UTF-8 locales):

pete 252 /\ gls -lN zo*
-rw-r--r-- 1 pete admin 281829 20. Mär 1998  zoä€.au
pete 253 /\ ls -lw zo*
-rw-r--r--   1 pete  admin  281829 20 Mär  1998 zoä€.au
pete 254 /\ gls -lN zo* | od -j 32 -t a
0000040    0   .  sp   M   \303   \244   r  sp   1   9   9   8  sp   
sp   z   o
0000060    a   \314  88   \342  82   \254   .   a   u  nl
0000072
pete 255 /\ env LC_CTYPE=de_DE.ISO8859-15 LANG=de_DE.ISO8859-15 gls - 
lN zo* | od -j 32 -t a
0000040    0   .  sp   M   \344   r  sp   1   9   9   8  sp  sp   z    
o   a
0000060    \314  88   \342  82   \254   .   a   u  nl
0000071
pete 256 /\ ls -lw zo* | od -j 32 -t a
0000040    2   9  sp   2   0  sp   M   \303   \244   r  sp   1   9    
9   8
0000060   sp   z   o   a   \314  88   \342  82   \254   .   a   u  nl
0000075
pete 257 /\ env LC_CTYPE=de_DE.ISO8859-15 LANG=de_DE.ISO8859-15 ls - 
lw zo* | od -j 32 -t a
0000040    2   9  sp   2   0  sp   M   \344   r  sp  sp   1   9   9    
8  sp
0000060    z   o   a   \314  88   \342  82   \254   .   a   u  nl
0000074

So the *ls commands deliver the month date in their locale composed  
while the file name is always *de*composed UTF-8:

\303 \244    = C3 A4    = LATIN SMALL LETTER A WITH DIAERESIS ä at U 
+00E4
\314 88      = CC 88    = COMBINING DIAERESIS                 ¨ at U 
+0308
\342 82 \254 = E2 88 AC = EURO SIGN                           € at U 
+20AC

--
Greetings

   Pete

Bake pizza not war!








^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2009-08-19  0:23     ` bug#4157: " Kenichi Handa
  2009-08-19 22:47       ` Peter Dyballa
@ 2009-08-24 11:30       ` Peter Dyballa
  2009-08-24 12:22         ` bug#4157: " Kenichi Handa
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Dyballa @ 2009-08-24 11:30 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: 4157


Am 19.08.2009 um 02:23 schrieb Kenichi Handa:

> Ah, I found this code in mule-cmds.el.
>
>   (if (eq system-type 'darwin)
>       ;; The file-name coding system on Darwin systems is always  
> utf-8.
>       (setq default-file-name-coding-system 'utf-8)
>
> I don't remember why that code exists.  If the comment is
> wrong (i.e. there's no need of treating darwin specially
> here), the attached patch should solve the problem.

I finally managed to build a stable GNU Emacs! In ISO Latin-9/ISO  
8859-15 environment default-file-name-coding-system is utf-8 and file- 
name-coding-system in nil, local in each of the visited dired buffers  
(0 in mode-lines). So again I see the file names (almost) correctly  
(the composed characters are taken, as usual, from some arbitrary  
fonts) and the month date field as M\344r instead of Mär and the \344  
character (4194276, #o17777744, #x3fffe4), although part of ISO  
8859-15, is supposed to be a raw byte and faultily declared as "not  
encodable by coding system iso-latin-9-unix." In the variant launched  
with UTF-8 this date field is displayed as Mär and this *obviously  
composed* ä described correctly as ä (228, #o344, #xe4) and taken to  
be displayed from an iso10646-1 encoded font. The buffer (and file)  
code is described as UTF-8 C3A4: #xC3 #xA4 (encoded by coding system  
utf-8-unix).

--
Greetings

   Pete

We also sponsor National Invisible Chronic Illness Awareness Week  
annually in September.
Join the millions








^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: Re: bug#4157: 23.1.50; faulty character characterisation for ä
  2009-08-24 11:30       ` Peter Dyballa
@ 2009-08-24 12:22         ` Kenichi Handa
  2009-08-24 15:21           ` Peter Dyballa
                             ` (3 more replies)
  0 siblings, 4 replies; 47+ messages in thread
From: Kenichi Handa @ 2009-08-24 12:22 UTC (permalink / raw)
  To: Peter Dyballa; +Cc: 4157

In article <56EC0D72-D541-470F-9FAB-2F766BD45601@Freenet.DE>, Peter Dyballa <Peter_Dyballa@Freenet.DE> writes:

> I finally managed to build a stable GNU Emacs! In ISO Latin-9/ISO  
> 8859-15 environment default-file-name-coding-system is utf-8 and file- 
> name-coding-system in nil, local in each of the visited dired buffers  
> (0 in mode-lines).

Ok, so dired is going to decode the output of ls by utf-8.

> So again I see the file names (almost) correctly  
> (the composed characters are taken, as usual, from some arbitrary  
> fonts) and the month date field as M\344r instead of Mär and the \344  
> character (4194276, #o17777744, #x3fffe4), although part of ISO  
> 8859-15, is supposed to be a raw byte and faultily declared as "not  
> encodable by coding system iso-latin-9-unix."

No, Emacs just tries to encode \344 by utf-8 and correctly
declared that it is not encodable by utf-8.

In article <jwvfxbjb8t1.fsf-monnier+emacsbugreports@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

> So my guess was right: ls's output uses utf-8 for the filenames, but
> latin-1 for the date...

I think that is your case (latin-9 instead of latin-1).

Stefan also wrotes:

> which is why it's difficult for dired to do the
> right thing (it's not impossible, of course, but it's more work and
> dired is currently not setup for that).

How about making dired decode the filename part by
file-name-coding-system and the rest part by
default-process-coding-system?

By the way,

> So again I see the file names (almost) correctly  
> (the composed characters are taken, as usual, from some arbitrary  
> fonts)

Please try to load ucs-normalize and set
file-name-coding-system to utf-8-hfs.  You should see file
names correctly by precomposed characters as "ä".

---
Kenichi Handa
handa@m17n.org





^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2009-08-24 12:22         ` bug#4157: " Kenichi Handa
@ 2009-08-24 15:21           ` Peter Dyballa
  2009-08-25  0:46             ` bug#4157: " Kenichi Handa
  2009-08-25 22:19           ` Peter Dyballa
                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 47+ messages in thread
From: Peter Dyballa @ 2009-08-24 15:21 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: 4157


Am 24.08.2009 um 14:22 schrieb Kenichi Handa:

>> So again I see the file names (almost) correctly
>> (the composed characters are taken, as usual, from some arbitrary
>> fonts)
>
> Please try to load ucs-normalize and set
> file-name-coding-system to utf-8-hfs.  You should see file
> names correctly by precomposed characters as "«£".


Even without this new file I could see composed characters.

	(require 'ucs-normalize)
	ucs-normalize

or

	(load-library "ucs-normalize")
	t

in the new and still not installed GNU Emacs 23.1.50 makes no  
difference. The version from three weeks ago does not allow to  
separate the composed character's components, i.e., the text cursor  
cannot select this or that component, one step and it has reached the  
previous or next character. A difference I can see comes C-u C-x =:  
the line

   canonical-combining-class: 0 (Spacing, split, enclosing,  
reordrant, and Tibetan subjoined)

is removed from the output.


The composed character still is not taken from the default font.  
Could be one component is missing, ¨ – but it has the precomposed  
characters I usually use.

--
Greetings

   Pete

Treffen sich zwei Funktionen.
Sagt die eine: „Verschwinde oder ich differenzier' dich!“
Erwidert die andere: „Ätsch, ich bin exponentiell!“






^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: Re: bug#4157: 23.1.50; faulty character characterisation for ä
  2009-08-24 15:21           ` Peter Dyballa
@ 2009-08-25  0:46             ` Kenichi Handa
  2009-08-25  7:51               ` Peter Dyballa
  0 siblings, 1 reply; 47+ messages in thread
From: Kenichi Handa @ 2009-08-25  0:46 UTC (permalink / raw)
  To: Peter Dyballa; +Cc: 4157

In article <AB00A4BE-E1DF-45F4-8953-23B2F7E551F9@Freenet.DE>, Peter Dyballa <Peter_Dyballa@Freenet.DE> writes:

> The version from three weeks ago does not allow to  
> separate the composed character's components, i.e., the text cursor  
> cannot select this or that component, one step and it has reached the  
> previous or next character.

Even if your buffer has the char sequence "a" and "¨", "¨"
is displayed on top of "a" and cursor movement treats those
two characters atomically as if there's a single character
"ä".

> A difference I can see comes C-u C-x =: the line

>    canonical-combining-class: 0 (Spacing, split, enclosing,  
> reordrant, and Tibetan subjoined)

> is removed from the output.

That's the point.  utf-8-hfs converts the sequence "a" and
"¨" to a single character "ä" on decoding, and breaks down
"ä" to "a" and "¨" on encoding.

> The composed character still is not taken from the default font.  
> Could be one component is missing, ¨ – but it has the precomposed  
> characters I usually use.

Which font is selected for the character "ä"?  It seems to
be a bug if the font is not your default font.

---
Kenichi Handa
handa@m17n.org





^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2009-08-25  0:46             ` bug#4157: " Kenichi Handa
@ 2009-08-25  7:51               ` Peter Dyballa
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Dyballa @ 2009-08-25  7:51 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: 4157


Am 25.08.2009 um 02:46 schrieb Kenichi Handa:

>> The composed character still is not taken from the default font.
>> Could be one component is missing, ¨ – but it has the precomposed
>> characters I usually use.
>
> Which font is selected for the character "ä"?  It seems to
> be a bug if the font is not your default font.

   x:-mutt-clearlyu-medium-r-normal--17-120-100-100-p-123-iso10646-1

My default font is -b&h-lucidatypewriter-medium-r-normal- 
sans-10-100-75-75-m-60-iso10646-1.


I'll update GNU Emacs and study the NEWS.

--
Greetings

   Pete

Either this man is dead or my watch has stopped.
				- Groucho Marx








^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2009-08-24 12:22         ` bug#4157: " Kenichi Handa
  2009-08-24 15:21           ` Peter Dyballa
@ 2009-08-25 22:19           ` Peter Dyballa
  2009-08-27  6:52             ` bug#4157: " Kenichi Handa
  2009-08-28 19:27           ` Peter Dyballa
  2009-08-31 21:11           ` Peter Dyballa
  3 siblings, 1 reply; 47+ messages in thread
From: Peter Dyballa @ 2009-08-25 22:19 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: 4157


Am 24.08.2009 um 14:22 schrieb Kenichi Handa:

> Please try to load ucs-normalize and set
> file-name-coding-system to utf-8-hfs.  You should see file
> names correctly by precomposed characters as "«£".


When I copy a line from a dired buffer with a composed character  
taken from another font into *scratch* buffer and then apply on the  
marked file name ucs-normalize-HFS-NFC-region, then the foreign glyph  
is changed to one from the default font. Automatically this does not  
happen in dired buffer, although global-auto-composition-mode and  
auto-composition-mode are both t.

And ucs-normalize is auto-loaded!

--
Greetings
                                  <]
   Pete       o        __o         |__    o           recumbo
     ___o    /I       -\<,         |o \  -\),-%       ergo sum!
___/\ /\___./ \___...O/ O____.....`-O-'-()--o_________________








^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: Re: bug#4157: 23.1.50; faulty character characterisation for ä
  2009-08-25 22:19           ` Peter Dyballa
@ 2009-08-27  6:52             ` Kenichi Handa
  2009-08-27  8:50               ` Peter Dyballa
  0 siblings, 1 reply; 47+ messages in thread
From: Kenichi Handa @ 2009-08-27  6:52 UTC (permalink / raw)
  To: Peter Dyballa; +Cc: 4157

In article <BB0E3F43-9DA7-42FB-B4F4-63AABC2719DF@Freenet.DE>, Peter Dyballa <Peter_Dyballa@Freenet.DE> writes:

> > Please try to load ucs-normalize and set
> > file-name-coding-system to utf-8-hfs.  You should see file
> > names correctly by precomposed characters as "ä".

> When I copy a line from a dired buffer with a composed character  
> taken from another font into *scratch* buffer and then apply on the  
> marked file name ucs-normalize-HFS-NFC-region, then the foreign glyph  
> is changed to one from the default font. Automatically this does not  
> happen in dired buffer, although global-auto-composition-mode and  
> auto-composition-mode are both t.

Strange.  On GNU/Linux, I set file-name-coding-system to
utf-8-hfs, create a new file "ä" by:
  ESC : (write-region "test" nil "ä") RET
I confirmed that the file name is surely the two char
sequence of "a" and "̈" by another emacs.

Then M-x dired shows that file name by precomposed character
"ä" (LATIN SMALL LETTER A WITH DIAERESIS).

That means utf-8-hfs does convert the sequence "a" and "̈"
to/from "ä".  I have no idea why it doesn't work in your
environment.

> And ucs-normalize is auto-loaded!

When is it loaded?

---
Kenichi Handa
handa@m17n.org





^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: Re: bug#4157: 23.1.50; faulty character characterisation for ä
  2009-08-27  6:52             ` bug#4157: " Kenichi Handa
@ 2009-08-27  8:50               ` Peter Dyballa
  2009-08-27 11:33                 ` bug#4157: " Kenichi Handa
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Dyballa @ 2009-08-27  8:50 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: 4157


Am 27.08.2009 um 08:52 schrieb Kenichi Handa:

> I have no idea why it doesn't work in your environment.
>
>> And ucs-normalize is auto-loaded!
>
> When is it loaded?
>


Actually never! There are just precautions taken. When I deliberately  
set file-name-coding-system to utf-8-hfs in my init file, GNU Emacs  
stopped to initialise with an error message about an undefined  
encoding. So ucs-normalize obviously was not loaded. I added a  
(require 'ucs-normalize) – and it works! It works exceptionally well:  
i-search for ä or æ or ø in file names works.

That's really good work!


What I wonder is why so many different font encodings are used when  
characters are described. Wouldn't it make sense to use an iso10646-1  
encoding in an UTF-8 environment for characters from 8-bit ISO  
encodings? Wouldn't it free resources when less fonts are used? Or is  
it my fault that I include definitions for ISO encodings in my font set?

--
Greetings

   Pete

Government is actually the worst failure of civilized man. There has  
never been a really good one, and even those that are most tolerable  
are arbitrary, cruel, grasping and unintelligent.
				– H. L. Mencken








^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: Re: Re: bug#4157: 23.1.50;  faulty character characterisation for ä
  2009-08-27  8:50               ` Peter Dyballa
@ 2009-08-27 11:33                 ` Kenichi Handa
  2009-08-27 12:38                   ` Peter Dyballa
  0 siblings, 1 reply; 47+ messages in thread
From: Kenichi Handa @ 2009-08-27 11:33 UTC (permalink / raw)
  To: Peter Dyballa; +Cc: 4157

In article <AA0844EE-B5CA-4622-A457-70944BBCD2A7@Freenet.DE>, Peter Dyballa <Peter_Dyballa@Freenet.DE> writes:

>>> And ucs-normalize is auto-loaded!
> >
> > When is it loaded?

> Actually never! There are just precautions taken. When I deliberately  
> set file-name-coding-system to utf-8-hfs in my init file, GNU Emacs  
> stopped to initialise with an error message about an undefined  
> encoding. So ucs-normalize obviously was not loaded. I added a  
> (require 'ucs-normalize) – and it works! It works exceptionally well:  
> i-search for ä or æ or ø in file names works.

> That's really good work!

That's good.  Perhaps, we should add autoload cookie to
utf-8-hfs.

> What I wonder is why so many different font encodings are used when  
> characters are described. Wouldn't it make sense to use an iso10646-1  
> encoding in an UTF-8 environment for characters from 8-bit ISO  
> encodings? Wouldn't it free resources when less fonts are used? Or is  
> it my fault that I include definitions for ISO encodings in my font set?

I don't understand what you are saying.  Please tell more
precisely what these mean:
  o font encoding
  o characters from 8-bit ISO encodings
  o include definitions for ISO encodings in my font set?

---
Kenichi Handa
handa@m17n.org





^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: Re: Re: bug#4157: 23.1.50;  faulty character characterisation for ä
  2009-08-27 11:33                 ` bug#4157: " Kenichi Handa
@ 2009-08-27 12:38                   ` Peter Dyballa
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Dyballa @ 2009-08-27 12:38 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: 4157


Am 27.08.2009 um 13:33 schrieb Kenichi Handa:

>> What I wonder is why so many different font encodings are used when
>> characters are described. Wouldn't it make sense to use an iso10646-1
>> encoding in an UTF-8 environment for characters from 8-bit ISO
>> encodings? Wouldn't it free resources when less fonts are used? Or is
>> it my fault that I include definitions for ISO encodings in my  
>> font set?
>
> I don't understand what you are saying.  Please tell more
> precisely what these mean:
>   o font encoding

     x:-b&h-lucidatypewriter-medium-r-normal-sans-10-100-75-75-m-60- 
iso8859-1 (#x20)
     x:-b&h-lucidatypewriter-medium-r-normal-sans-10-100-75-75-m-60- 
iso8859-15 (#xE6)
     x:-b&h-lucidatypewriter-medium-r-normal-sans-10-100-75-75-m-60- 
iso10646-1 (#x20AC)


>   o characters from 8-bit ISO encodings

For example SPC or æ in the examples above.

>   o include definitions for ISO encodings in my font set?

     (create-fontset-from-fontset-spec "-b&h-lucidatypewriter-medium- 
r-*-*-10-*-*-*-*-*-fontset-10pt_lucida_sans_typewriter" t 'noerror)
	(set-fontset-font "fontset-10pt_lucida_sans_typewriter"       'latin- 
iso8859-1  '("lucidatypewriter" . "iso8859-1"))
	(set-fontset-font "fontset-10pt_lucida_sans_typewriter"       'latin- 
iso8859-2  '("lucidatypewriter" . "iso8859-2"))
	(set-fontset-font "fontset-10pt_lucida_sans_typewriter"       'latin- 
iso8859-3  '("lucidatypewriter" . "iso8859-3"))
	(set-fontset-font "fontset-10pt_lucida_sans_typewriter"       'latin- 
iso8859-4  '("lucidatypewriter" . "iso8859-4"))
	(set-fontset-font "fontset-10pt_lucida_sans_typewriter"    'cyrillic- 
iso8859-5  '("lucidatypewriter" . "iso8859-5"))
	(set-fontset-font "fontset-10pt_lucida_sans_typewriter"      'hebrew- 
iso8859-8  '("lucidatypewriter" . "iso8859-8"))
	(set-fontset-font "fontset-10pt_lucida_sans_typewriter"       'latin- 
iso8859-9  '("lucidatypewriter" . "iso8859-9"))
	(set-fontset-font "fontset-10pt_lucida_sans_typewriter"       'latin- 
iso8859-14 '("lucidatypewriter" . "iso8859-14"))
	(set-fontset-font "fontset-10pt_lucida_sans_typewriter"       'latin- 
iso8859-15 '("lucidatypewriter" . "iso8859-15"))
	(set-fontset-font "fontset-10pt_lucida_sans_typewriter"		   'thai- 
tis620 '("lucidatypewriter" . "iso10646-1"))
	(set-fontset-font "fontset-10pt_lucida_sans_typewriter" 'mule- 
unicode-0100-24ff '("code2000" . "iso10646-1"))
	(set-fontset-font "fontset-10pt_lucida_sans_typewriter" 'mule- 
unicode-2500-33ff '("code2000" . "iso10646-1"))
	(set-fontset-font "fontset-10pt_lucida_sans_typewriter" 'mule- 
unicode-e000-ffff '("code2000" . "iso10646-1"))

--
Mit friedvollen Grüßen

   Pete

Don't force it; get a larger hammer.
				– Anthony's Law of Force






^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2009-08-24 12:22         ` bug#4157: " Kenichi Handa
  2009-08-24 15:21           ` Peter Dyballa
  2009-08-25 22:19           ` Peter Dyballa
@ 2009-08-28 19:27           ` Peter Dyballa
  2009-08-31 21:11           ` Peter Dyballa
  3 siblings, 0 replies; 47+ messages in thread
From: Peter Dyballa @ 2009-08-28 19:27 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: 4157


Am 24.08.2009 um 14:22 schrieb Kenichi Handa:

> Please try to load ucs-normalize and set
> file-name-coding-system to utf-8-hfs.  You should see file
> names correctly by precomposed characters as "«£".

I think utf-8-hfs should also be used for process output! An ls (or  
gls) in *shell* buffer still has umlauts in the non-default font. And  
presumingly a shell-command which lists a directory contents, too...

--
Greetings

   Pete

When people run around and around in circles we say they are crazy.  
When planets do it we say they are orbiting.








^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2009-08-24 12:22         ` bug#4157: " Kenichi Handa
                             ` (2 preceding siblings ...)
  2009-08-28 19:27           ` Peter Dyballa
@ 2009-08-31 21:11           ` Peter Dyballa
  2009-09-01  0:04             ` Stefan Monnier
  3 siblings, 1 reply; 47+ messages in thread
From: Peter Dyballa @ 2009-08-31 21:11 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: 4157


Am 24.08.2009 um 14:22 schrieb Kenichi Handa:

> Please try to load ucs-normalize and set
> file-name-coding-system to utf-8-hfs.


My test files were originally on an HFS+ and on an UFS (UNIX File  
System) volume (partition, slice, ...). This evening I copied them to  
an MS-DOS FAT16 file system. When I invoke GNU Emacs with -Q I see in  
all three file systems the decomposed characters in the file names.  
With ucs-normalize loaded and file-name-coding-system set to utf-8- 
hfs the look in all three file systems OK. This makes the chosen name  
utf-8-hfs not the best. Maybe utf-8-osx is more appropriate.


Anyway, I get good results with:

	(require 'ucs-normalize)
	(setq file-name-coding-system   'utf-8-hfs)
	(prefer-coding-system           'utf-8-hfs)

Isearch works, *Buffer List* contains the correct names – only the  
Buffers menu fails (except when in the Cocoa Emacs.app). Possibly  
this is best reported in another bug report.

--
Greetings

   Pete

Treffen sich zwei Funktionen.
Sagt die eine: „Verschwinde oder ich differenzier' dich!“
Erwidert die andere: „Ätsch, ich bin exponentiell!“






^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2009-08-31 21:11           ` Peter Dyballa
@ 2009-09-01  0:04             ` Stefan Monnier
  2009-09-04  0:58               ` Kenichi Handa
  0 siblings, 1 reply; 47+ messages in thread
From: Stefan Monnier @ 2009-09-01  0:04 UTC (permalink / raw)
  To: Peter Dyballa; +Cc: 4157

> loaded and file-name-coding-system set to utf-8- 
> hfs the look in all three file systems OK. This makes the chosen name
> utf-8-hfs not the best.  Maybe utf-8-osx is more appropriate.

Good point.  Or maybe utf-8-darwin.


        Stefan





^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2009-09-01  0:04             ` Stefan Monnier
@ 2009-09-04  0:58               ` Kenichi Handa
  0 siblings, 0 replies; 47+ messages in thread
From: Kenichi Handa @ 2009-09-04  0:58 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 4157, Peter_Dyballa, kawabata.taichi

In article <0B33C588-C7AD-41D9-8CAC-51AEBD40B264@Freenet.DE>, Peter Dyballa <Peter_Dyballa@Freenet.DE> writes:

> My test files were originally on an HFS+ and on an UFS (UNIX File  
> System) volume (partition, slice, ...). This evening I copied them to  
> an MS-DOS FAT16 file system. When I invoke GNU Emacs with -Q I see in  
> all three file systems the decomposed characters in the file names.  
> With ucs-normalize loaded and file-name-coding-system set to utf-8- 
> hfs the look in all three file systems OK. This makes the chosen name  
> utf-8-hfs not the best. Maybe utf-8-osx is more appropriate.

In article <jwv8wgz4jkj.fsf-monnier+emacsbugreports@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

> Good point.  Or maybe utf-8-darwin.

Kawabata-san, what do you think?

---
Kenichi Handa
handa@m17n.org





^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
@ 2009-09-04  5:51 川幡太一
  0 siblings, 0 replies; 47+ messages in thread
From: 川幡太一 @ 2009-09-04  5:51 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: 4157, Peter_Dyballa

Hi,

I'm on the side of keeping the name of the coding-system to be 'utf-8-hfs',
as this coding system is defined by the specification of HFS+
(http://developer.apple.com/technotes/tn/tn1150.html), rather than
MacOS itself.   This implies that if other OS mounts HFS, they should
still apply "modified-NFD" for the file names.

Besides, the other components of MacOS handles UTF-8 as NFC, as seen
by the spotlight, etc.

It is very unfortunate (and possibly flaw) of Carbon API that they do not
care the file system they are accessing.  One must care by himself when
copying files among different file systems.  (For example, when I back-up
files among file systems with "rsync", I usually put some options such
as "--iconv=UTF8-MAC,UTF-8")... sigh....

Cheers,

2009/9/4 Kenichi Handa <handa@m17n.org>:
> In article <0B33C588-C7AD-41D9-8CAC-51AEBD40B264@Freenet.DE>, Peter Dyballa <Peter_Dyballa@Freenet.DE> writes:
>
>> My test files were originally on an HFS+ and on an UFS (UNIX File
>> System) volume (partition, slice, ...). This evening I copied them to
>> an MS-DOS FAT16 file system. When I invoke GNU Emacs with -Q I see in
>> all three file systems the decomposed characters in the file names.
>> With ucs-normalize loaded and file-name-coding-system set to utf-8-
>> hfs the look in all three file systems OK. This makes the chosen name
>> utf-8-hfs not the best. Maybe utf-8-osx is more appropriate.
>
> In article <jwv8wgz4jkj.fsf-monnier+emacsbugreports@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:
>
>> Good point.  Or maybe utf-8-darwin.
>
> Kawabata-san, what do you think?
>
> ---
> Kenichi Handa
> handa@m17n.org
>



-- 
---------------------------------------------------------------------
 川幡 太一 (KAWABATA, Taichi)   E-mail: kawabata@clock.ocn.ne.jp
                  kawabata.taichi@gmail.com





^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2009-08-16  2:19 bug#4157: 23.1.50; faulty character characterisation for ä Peter Dyballa
  2009-08-18  1:09 ` Kenichi Handa
  2009-08-22  4:09 ` Stefan Monnier
@ 2019-10-09 14:29 ` Stefan Kangas
  2019-10-09 18:48   ` Eli Zaretskii
  2019-10-09 19:47   ` Stefan Monnier
  2 siblings, 2 replies; 47+ messages in thread
From: Stefan Kangas @ 2019-10-09 14:29 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Peter Dyballa, 4157

found 4157 27.0.50
quit

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>> Hmm... that looks like a problem in dired: the file names in the output
>>> of `ls' should follow file-name-coding-system, whereas the rest of the
>>> output seem to use locale-coding-system.  Coudl you check if that's
>>> indeed the case:
>>> - create a file from the Finder using accented latin-1
>>> chars, as well as non-latin-1 chars).
>>> - look at it in your dired and tell us what you see.
>> In both locales the *file names* are correct and also detected as containing
>
> "correct" doesn't really tell me what you see, but I see what you mean.
>
>> "composed characters," it's a problem with the file's  month date. In the
>
> So my guess was right: ls's output uses utf-8 for the filenames, but
> latin-1 for the date, which is why it's difficult for dired to do the
> right thing (it's not impossible, of course, but it's more work and
> dired is currently not setup for that).

Ten years later, I can verify that this is still an issue on current
master running on macOS 10.13.  I think Stefan Monnier is spot on
above.

Best regards,
Stefan Kangas





^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2019-10-09 14:29 ` Stefan Kangas
@ 2019-10-09 18:48   ` Eli Zaretskii
  2019-10-09 19:47   ` Stefan Monnier
  1 sibling, 0 replies; 47+ messages in thread
From: Eli Zaretskii @ 2019-10-09 18:48 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: Peter_Dyballa, monnier, 4157

> From: Stefan Kangas <stefan@marxist.se>
> Date: Wed, 9 Oct 2019 16:29:43 +0200
> Cc: Peter Dyballa <Peter_Dyballa@freenet.de>, 4157@debbugs.gnu.org
> 
> > "correct" doesn't really tell me what you see, but I see what you mean.
> >
> >> "composed characters," it's a problem with the file's  month date. In the
> >
> > So my guess was right: ls's output uses utf-8 for the filenames, but
> > latin-1 for the date, which is why it's difficult for dired to do the
> > right thing (it's not impossible, of course, but it's more work and
> > dired is currently not setup for that).
> 
> Ten years later, I can verify that this is still an issue on current
> master running on macOS 10.13.  I think Stefan Monnier is spot on
> above.

Maybe we should switch macOS to using ls-lisp.el?





^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2019-10-09 14:29 ` Stefan Kangas
  2019-10-09 18:48   ` Eli Zaretskii
@ 2019-10-09 19:47   ` Stefan Monnier
  2019-10-09 22:42     ` Peter Dyballa
  2019-10-10  0:10     ` Stefan Kangas
  1 sibling, 2 replies; 47+ messages in thread
From: Stefan Monnier @ 2019-10-09 19:47 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: Peter Dyballa, 4157

>> So my guess was right: ls's output uses utf-8 for the filenames, but
>> latin-1 for the date, which is why it's difficult for dired to do the
>> right thing (it's not impossible, of course, but it's more work and
>> dired is currently not setup for that).
>
> Ten years later, I can verify that this is still an issue on current
> master running on macOS 10.13.  I think Stefan Monnier is spot on
> above.

I understand why utf--8 is used for the filenames, but what makes the
month be output in latin-1?  macOS is supposedly an "all utf-8"
environment, AFAIK.

I'm not sure if macOS uses locales in the POSIX way, but... can you
check what is your locale set to (and ideally, maybe, check what/who
sets it)?


        Stefan






^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2019-10-09 19:47   ` Stefan Monnier
@ 2019-10-09 22:42     ` Peter Dyballa
  2019-11-11  1:49       ` Stefan Kangas
  2019-10-10  0:10     ` Stefan Kangas
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Dyballa @ 2019-10-09 22:42 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Stefan Kangas, 4157


> Am 9.10.2019 um 21:47 schrieb Stefan Monnier <monnier@iro.umontreal.ca>:
> 
> but... can you
> check what is your locale set to (and ideally, maybe, check what/who
> sets it)?

I am using three areas to set LANG and LC_CTYPE, each to the value of "de_DE.UTF-8." This is necessary because of macOS (and then Mac OS X). First it's ~/.MacOSX/environment.plist. This Property LIST file sets the two for the GUI login environment. Although all other processes should inherit from it I use constructs à la

	setenv LC_CTYPE	`defaults read ~/.MacOSX/environment LC_CTYPE`

or

	export LC_CTYPE=`defaults read "${HOME}/.MacOSX/environment" LC_CTYPE`

in ~/.cshrc resp. ~/.profile (my login shell is tcsh 6.18.01 (Astron) 2012-02-14 (x86_64-apple-darwin) options wide,nls,dl,bye,al,kan,sm,rh,color,filec), and also in my ~/.xinitrc file that X11 also inherits these settings.

Dired is set up to use gls (now GNU coreutils 8.31). The system's /bin/ls neither understands -D nor --dired.

Performing a

	(shell-command "printenv | sort" nil "*stderr*")

in mini-buffer reports, among others:

	LANG=de_DE.UTF-8
	LC_CTYPE=de_DE.UTF-8


--
Greetings

  Pete

If builders built buildings the way programmers write programs, then the first woodpecker that came along would destroy civilization.
				– Weinberg's Second Law






^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2019-10-09 19:47   ` Stefan Monnier
  2019-10-09 22:42     ` Peter Dyballa
@ 2019-10-10  0:10     ` Stefan Kangas
  2019-10-10  7:20       ` Eli Zaretskii
                         ` (2 more replies)
  1 sibling, 3 replies; 47+ messages in thread
From: Stefan Kangas @ 2019-10-10  0:10 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Peter Dyballa, 4157

Stefan Monnier <monnier@iro.umontreal.ca> writes:

> >> So my guess was right: ls's output uses utf-8 for the filenames, but
> >> latin-1 for the date, which is why it's difficult for dired to do the
> >> right thing (it's not impossible, of course, but it's more work and
> >> dired is currently not setup for that).
> >
> > Ten years later, I can verify that this is still an issue on current
> > master running on macOS 10.13.  I think Stefan Monnier is spot on
> > above.
>
> I understand why utf--8 is used for the filenames, but what makes the
> month be output in latin-1?  macOS is supposedly an "all utf-8"
> environment, AFAIK.
>
> I'm not sure if macOS uses locales in the POSIX way, but... can you
> check what is your locale set to (and ideally, maybe, check what/who
> sets it)?

I've never tried changing from UTF-8 myself, and use the default
English language macOS system setting.  My default environment is
simply:

$ env | grep ^L[CA]
LC_CTYPE=UTF-8

To see this, I was running:

LC_CTYPE=de_DE.ISO8859-15 LANG=de_DE.ISO8859-15 ./src/emacs -Q

When I replace "./src/emacs -Q" with "ls -l" in terminal, I get
strange characters for files with mtime in March.  (I tried this with
the default Terminal.app as well as another terminal emulator called
iterm2.)  The month name is "März" in German but when it's in the
date, the character "ä" shows up as "?".  Meanwhile, any filenames
with the same character displays correctly, like so:

-rw-r--r--    1 skangas  staff      0 10 Okt 01:59 März
drwxr-xr-x    3 skangas  staff     96 10 M?r  2017 foobar

I see no problems displaying "ä" when I run:

LC_CTYPE=de_DE.UTF-8 LANG=de_DE.UTF-8 ./src/emacs -Q

Perhaps you're just not supposed to use anything but UTF-8 on macOS?
And this is just a configuration error?

Best regards,
Stefan Kangas





^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2019-10-10  0:10     ` Stefan Kangas
@ 2019-10-10  7:20       ` Eli Zaretskii
  2019-10-10 10:36         ` Stefan Kangas
  2019-10-10  8:15       ` Andreas Schwab
  2019-10-10 12:54       ` Stefan Monnier
  2 siblings, 1 reply; 47+ messages in thread
From: Eli Zaretskii @ 2019-10-10  7:20 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: Peter_Dyballa, monnier, 4157

> Date: Thu, 10 Oct 2019 02:10:10 +0200
> Cc: Peter Dyballa <Peter_Dyballa@freenet.de>, 4157@debbugs.gnu.org
> 
> LC_CTYPE=de_DE.ISO8859-15 LANG=de_DE.ISO8859-15 ./src/emacs -Q
> 
> When I replace "./src/emacs -Q" with "ls -l" in terminal, I get
> strange characters for files with mtime in March.  (I tried this with
> the default Terminal.app as well as another terminal emulator called
> iterm2.)  The month name is "März" in German but when it's in the
> date, the character "ä" shows up as "?".  Meanwhile, any filenames
> with the same character displays correctly, like so:
> 
> -rw-r--r--    1 skangas  staff      0 10 Okt 01:59 März
> drwxr-xr-x    3 skangas  staff     96 10 M?r  2017 foobar

Please redirect the output to a file, and look at the file with
hexl-find-file.  What do you see in the place where there's a question
mark above?





^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2019-10-10  0:10     ` Stefan Kangas
  2019-10-10  7:20       ` Eli Zaretskii
@ 2019-10-10  8:15       ` Andreas Schwab
  2019-10-10 12:54       ` Stefan Monnier
  2 siblings, 0 replies; 47+ messages in thread
From: Andreas Schwab @ 2019-10-10  8:15 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: Peter Dyballa, Stefan Monnier, 4157

On Okt 10 2019, Stefan Kangas <stefan@marxist.se> wrote:

> LC_CTYPE=de_DE.ISO8859-15 LANG=de_DE.ISO8859-15 ./src/emacs -Q
>
> When I replace "./src/emacs -Q" with "ls -l" in terminal, I get
> strange characters for files with mtime in March.  (I tried this with
> the default Terminal.app as well as another terminal emulator called
> iterm2.)  The month name is "März" in German but when it's in the
> date, the character "ä" shows up as "?".  Meanwhile, any filenames
> with the same character displays correctly, like so:
>
> -rw-r--r--    1 skangas  staff      0 10 Okt 01:59 März
> drwxr-xr-x    3 skangas  staff     96 10 M?r  2017 foobar

You are instructing ls to use Latin-9 for formatting the date, but it
doesn't do anything with the file names it gets from the system.  So
this is expected.  You will see exactly the same behaviour on Linux.

Andreas.

-- 
Andreas Schwab, SUSE Labs, schwab@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."





^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2019-10-10  7:20       ` Eli Zaretskii
@ 2019-10-10 10:36         ` Stefan Kangas
  2019-10-10 11:20           ` Eli Zaretskii
  0 siblings, 1 reply; 47+ messages in thread
From: Stefan Kangas @ 2019-10-10 10:36 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Peter Dyballa, Stefan Monnier, 4157

Eli Zaretskii <eliz@gnu.org> writes:

> > -rw-r--r--    1 skangas  staff      0 10 Okt 01:59 März
> > drwxr-xr-x    3 skangas  staff     96 10 M?r  2017 foobar
>
> Please redirect the output to a file, and look at the file with
> hexl-find-file.  What do you see in the place where there's a question
> mark above?

This file is from March 2018:

>00000e30: 6865 656c 2020 2020 3730 3420 3234 204d  heel    704 24 M
>00000e40: e472 2020 3230 3138 204f 6e42 6f61 7264  .r  2018 OnBoard
>00000e50: 696e 6742 756e 646c 6573 0a64 7277 7872  ingBundles.drwxr

Best regards,
Stefan Kangas





^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2019-10-10 10:36         ` Stefan Kangas
@ 2019-10-10 11:20           ` Eli Zaretskii
  2019-10-10 11:52             ` Stefan Kangas
  2019-10-10 18:33             ` Peter Dyballa
  0 siblings, 2 replies; 47+ messages in thread
From: Eli Zaretskii @ 2019-10-10 11:20 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: Peter_Dyballa, monnier, 4157

> From: Stefan Kangas <stefan@marxist.se>
> Date: Thu, 10 Oct 2019 12:36:29 +0200
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>, Peter Dyballa <Peter_Dyballa@freenet.de>, 
> 	4157@debbugs.gnu.org
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > > -rw-r--r--    1 skangas  staff      0 10 Okt 01:59 März
> > > drwxr-xr-x    3 skangas  staff     96 10 M?r  2017 foobar
> >
> > Please redirect the output to a file, and look at the file with
> > hexl-find-file.  What do you see in the place where there's a question
> > mark above?
> 
> This file is from March 2018:
> 
> >00000e30: 6865 656c 2020 2020 3730 3420 3234 204d  heel    704 24 M
> >00000e40: e472 2020 3230 3138 204f 6e42 6f61 7264  .r  2018 OnBoard
> >00000e50: 696e 6742 756e 646c 6573 0a64 7277 7872  ingBundles.drwxr

Like Andreas says, the dates are in Latin-9, but the file names are in
UTF-8 (probably utf-8-hfs).  Maybe we should on macOS override the
locale when we invoke 'ls'?

And I repeat my question: how about using ls-lisp.el on macOS by
default?





^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2019-10-10 11:20           ` Eli Zaretskii
@ 2019-10-10 11:52             ` Stefan Kangas
  2019-10-10 12:39               ` Stefan Kangas
  2019-10-10 18:33             ` Peter Dyballa
  1 sibling, 1 reply; 47+ messages in thread
From: Stefan Kangas @ 2019-10-10 11:52 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Peter Dyballa, Stefan Monnier, 4157

Eli Zaretskii <eliz@gnu.org> writes:

> Like Andreas says, the dates are in Latin-9, but the file names are in
> UTF-8 (probably utf-8-hfs).  Maybe we should on macOS override the
> locale when we invoke 'ls'?

I like that idea.  Perhaps we could just make sure that the encoding
is always "UTF-8" while respecting the language.

> And I repeat my question: how about using ls-lisp.el on macOS by
> default?

IMHO, it would be better to improve the support for the BSD-derivative
'ls' versions.

Best regards,
Stefan Kangas





^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2019-10-10 11:52             ` Stefan Kangas
@ 2019-10-10 12:39               ` Stefan Kangas
  2019-10-10 12:41                 ` Stefan Kangas
  0 siblings, 1 reply; 47+ messages in thread
From: Stefan Kangas @ 2019-10-10 12:39 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Peter Dyballa, Stefan Monnier, 4157

[-- Attachment #1: Type: text/plain, Size: 457 bytes --]

Stefan Kangas <stefan@marxist.se> writes:

> > Like Andreas says, the dates are in Latin-9, but the file names are in
> > UTF-8 (probably utf-8-hfs).  Maybe we should on macOS override the
> > locale when we invoke 'ls'?
>
> I like that idea.  Perhaps we could just make sure that the encoding
> is always "UTF-8" while respecting the language.

How about something like the attached diff?  Would that be a
reasonable approach?

Best regards,
Stefan Kangas

[-- Attachment #2: setlang.diff --]
[-- Type: application/octet-stream, Size: 1677 bytes --]

diff --git a/lisp/dired.el b/lisp/dired.el
index 6e48d28b4c..dec4473396 100644
--- a/lisp/dired.el
+++ b/lisp/dired.el
@@ -1280,6 +1280,22 @@ dired-switches-recursive-p
   "Return non-nil if the string SWITCHES contains -R or --recursive."
   (dired-check-switches switches "R" "recursive"))
 
+(defmacro dired--with-encoding (encoding remote &rest body)
+  "Temporarily set LANG environment variable to ENCODING and run BODY.
+If optional argument REMOTE is non-nil, just run BODY."
+  (declare (indent 2))
+  `(if (not remote)
+       (let ((orig (getenv "LANG"))
+             (new orig))
+         (unwind-protect
+             (progn
+               (while (string-match "+\\.\\([^.@]+\\)" new)
+                 (setq new (replace-match encoding nil nil new)))
+               (setenv "LANG" new)
+               ,@body)
+           (setenv "LANG" orig)))
+     ,@body))
+
 (defun dired-insert-directory (dir switches &optional file-list wildcard hdr)
   "Insert a directory listing of DIR, Dired style.
 Use SWITCHES to make the listings.
@@ -1330,8 +1346,9 @@ dired-insert-directory
                             (executable-find "sh")))
                     (switch (if remotep "-c" shell-command-switch)))
                (unless
-                   (zerop
-                    (process-file sh nil (current-buffer) nil switch script))
+                   (dired--with-encoding "UTF-8" remotep
+                     (zerop
+                      (process-file sh nil (current-buffer) nil switch script)))
                  (user-error
                   "%s: No files matching wildcard" (cdr dir-wildcard)))
                (insert-directory-clean (point) switches)))

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2019-10-10 12:39               ` Stefan Kangas
@ 2019-10-10 12:41                 ` Stefan Kangas
  0 siblings, 0 replies; 47+ messages in thread
From: Stefan Kangas @ 2019-10-10 12:41 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Peter Dyballa, Stefan Monnier, 4157

[-- Attachment #1: Type: text/plain, Size: 189 bytes --]

Stefan Kangas <stefan@marxist.se> writes:

> How about something like the attached diff?  Would that be a
> reasonable approach?

Fixed a typo in the attached.

Best regards,
Stefan Kangas

[-- Attachment #2: setlang2.diff --]
[-- Type: application/octet-stream, Size: 1658 bytes --]

diff --git a/lisp/dired.el b/lisp/dired.el
index 6e48d28b4c..f7c8f853b9 100644
--- a/lisp/dired.el
+++ b/lisp/dired.el
@@ -1280,6 +1280,22 @@ dired-switches-recursive-p
   "Return non-nil if the string SWITCHES contains -R or --recursive."
   (dired-check-switches switches "R" "recursive"))
 
+(defmacro dired--with-encoding (encoding remote &rest body)
+  "Temporarily set LANG environment variable to ENCODING and run BODY.
+If REMOTE is non-nil, just run BODY."
+  (declare (indent 2))
+  `(if (not remote)
+       (let ((orig (getenv "LANG"))
+             (new orig))
+         (unwind-protect
+             (progn
+               (while (string-match "\\.\\([^.@]+\\)" new)
+                 (setq new (replace-match encoding nil nil new)))
+               (setenv "LANG" new)
+               ,@body)
+           (setenv "LANG" orig)))
+     ,@body))
+
 (defun dired-insert-directory (dir switches &optional file-list wildcard hdr)
   "Insert a directory listing of DIR, Dired style.
 Use SWITCHES to make the listings.
@@ -1330,8 +1346,9 @@ dired-insert-directory
                             (executable-find "sh")))
                     (switch (if remotep "-c" shell-command-switch)))
                (unless
-                   (zerop
-                    (process-file sh nil (current-buffer) nil switch script))
+                   (dired--with-encoding "UTF-8" remotep
+                     (zerop
+                      (process-file sh nil (current-buffer) nil switch script)))
                  (user-error
                   "%s: No files matching wildcard" (cdr dir-wildcard)))
                (insert-directory-clean (point) switches)))

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2019-10-10  0:10     ` Stefan Kangas
  2019-10-10  7:20       ` Eli Zaretskii
  2019-10-10  8:15       ` Andreas Schwab
@ 2019-10-10 12:54       ` Stefan Monnier
  2019-10-10 13:12         ` Stefan Kangas
  2 siblings, 1 reply; 47+ messages in thread
From: Stefan Monnier @ 2019-10-10 12:54 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: Peter Dyballa, 4157

> To see this, I was running:
>
> LC_CTYPE=de_DE.ISO8859-15 LANG=de_DE.ISO8859-15 ./src/emacs -Q

Definitely not a good idea.

> Perhaps you're just not supposed to use anything but UTF-8 on macOS?

Exactly.

I'd argue this applies to GNU/Linux as well nowadays, but macsox has
been utf-8 only since the very beginning, AFAIK, and it (for example)
enforces that file names are in utf-8, so using anything else than utf-8
in macosx is asking for trouble.


        Stefan






^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2019-10-10 12:54       ` Stefan Monnier
@ 2019-10-10 13:12         ` Stefan Kangas
  2019-11-17 20:58           ` Stefan Kangas
  0 siblings, 1 reply; 47+ messages in thread
From: Stefan Kangas @ 2019-10-10 13:12 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Peter Dyballa, 4157

Stefan Monnier <monnier@iro.umontreal.ca> writes:

> > Perhaps you're just not supposed to use anything but UTF-8 on macOS?
>
> Exactly.
>
> I'd argue this applies to GNU/Linux as well nowadays, but macsox has
> been utf-8 only since the very beginning, AFAIK, and it (for example)
> enforces that file names are in utf-8, so using anything else than utf-8
> in macosx is asking for trouble.

FWIW, I would've nothing against closing this as wontfix, and write
this up as a case of "don't do that then".

Best regards,
Stefan Kangas





^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2019-10-10 11:20           ` Eli Zaretskii
  2019-10-10 11:52             ` Stefan Kangas
@ 2019-10-10 18:33             ` Peter Dyballa
  2019-10-10 18:57               ` Eli Zaretskii
  2019-10-11  7:10               ` Andreas Schwab
  1 sibling, 2 replies; 47+ messages in thread
From: Peter Dyballa @ 2019-10-10 18:33 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Stefan Kangas, monnier, 4157


> Am 10.10.2019 um 13:20 schrieb Eli Zaretskii <eliz@gnu.org>:
> 
> Like Andreas says, the dates are in Latin-9, but the file names are in
> UTF-8 (probably utf-8-hfs).

Alright, file names in Mac OS X were recorded in a special form of UTF-8, accented characters as two characters. GNU ls outputs the month name correctly, but not the file name, which is still held in UTF-8 and not converted to ISO Latin-9. So it's more of a GNU ls bug.

--
Greetings

  Pete

The next generation of interesting software will be done on the Macintosh, not the IBM PC.
				– Bill Gates, Nov 1984)









^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2019-10-10 18:33             ` Peter Dyballa
@ 2019-10-10 18:57               ` Eli Zaretskii
  2019-10-10 21:07                 ` Stefan Monnier
  2019-10-11  7:10               ` Andreas Schwab
  1 sibling, 1 reply; 47+ messages in thread
From: Eli Zaretskii @ 2019-10-10 18:57 UTC (permalink / raw)
  To: Peter Dyballa; +Cc: stefan, monnier, 4157

> From: Peter Dyballa <Peter_Dyballa@Freenet.DE>
> Date: Thu, 10 Oct 2019 20:33:56 +0200
> Cc: Stefan Kangas <stefan@marxist.se>,
>  monnier@iro.umontreal.ca,
>  4157@debbugs.gnu.org
> 
> Alright, file names in Mac OS X were recorded in a special form of UTF-8, accented characters as two characters. GNU ls outputs the month name correctly, but not the file name, which is still held in UTF-8 and not converted to ISO Latin-9. So it's more of a GNU ls bug.

No, I don't think it is.  No version of 'ls' I know of, including GNU
'ls', recodes file names, they just emit the bytestream they find in
the directory.  The idea is that you create files and display them
under the same setting of the locale's codeset.  If you change the
codeset between the time you created the file and the time you display
it, you are toast.  AFAIK, this happens on any Posix filesystem.





^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2019-10-10 18:57               ` Eli Zaretskii
@ 2019-10-10 21:07                 ` Stefan Monnier
  2019-10-11 13:33                   ` Stefan Kangas
  0 siblings, 1 reply; 47+ messages in thread
From: Stefan Monnier @ 2019-10-10 21:07 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Peter Dyballa, stefan, 4157

>> Alright, file names in Mac OS X were recorded in a special form of UTF-8,
>> accented characters as two characters. GNU ls outputs the month name
>> correctly, but not the file name, which is still held in UTF-8 and not
>> converted to ISO Latin-9. So it's more of a GNU ls bug.
>
> No, I don't think it is.  No version of 'ls' I know of, including GNU
> 'ls', recodes file names, they just emit the bytestream they find in
> the directory.  The idea is that you create files and display them
> under the same setting of the locale's codeset.  If you change the
> codeset between the time you created the file and the time you display
> it, you are toast.  AFAIK, this happens on any Posix filesystem.

There are different ways to look at the problem and attribute blame
(e.g. since macosx enforces file names to be utf-8 (contrary to POSIX),
`ls` in macosx *could* do the recoding of filenames reliably), but in
any case I think it's clear for me that it's not a bug in Emacs.


        Stefan






^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2019-10-10 18:33             ` Peter Dyballa
  2019-10-10 18:57               ` Eli Zaretskii
@ 2019-10-11  7:10               ` Andreas Schwab
  2019-10-11  7:23                 ` Peter Dyballa
  1 sibling, 1 reply; 47+ messages in thread
From: Andreas Schwab @ 2019-10-11  7:10 UTC (permalink / raw)
  To: Peter Dyballa; +Cc: Stefan Kangas, monnier, 4157

On Okt 10 2019, Peter Dyballa <Peter_Dyballa@Freenet.DE> wrote:

>> Am 10.10.2019 um 13:20 schrieb Eli Zaretskii <eliz@gnu.org>:
>> 
>> Like Andreas says, the dates are in Latin-9, but the file names are in
>> UTF-8 (probably utf-8-hfs).
>
> Alright, file names in Mac OS X were recorded in a special form of UTF-8, accented characters as two characters. GNU ls outputs the month name correctly, but not the file name, which is still held in UTF-8 and not converted to ISO Latin-9. So it's more of a GNU ls bug.

MacOS doesn't use GNU ls.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."





^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2019-10-11  7:10               ` Andreas Schwab
@ 2019-10-11  7:23                 ` Peter Dyballa
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Dyballa @ 2019-10-11  7:23 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Stefan Kangas, monnier, 4157


> Am 11.10.2019 um 09:10 schrieb Andreas Schwab <schwab@linux-m68k.org>:
> 
> MacOS doesn't use GNU ls.

That's correct. It is using ls from FreeBSD. Therefore I am using GNU ls from coreutils.

--
Mit friedvollen Grüßen

  Pete

In Deutschland kann es keine Revolution geben, weil mensch dazu den Rasen betreten müsste.
				- Joseph Stalin






^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2019-10-10 21:07                 ` Stefan Monnier
@ 2019-10-11 13:33                   ` Stefan Kangas
  0 siblings, 0 replies; 47+ messages in thread
From: Stefan Kangas @ 2019-10-11 13:33 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Peter Dyballa, 4157

Stefan Monnier <monnier@iro.umontreal.ca> writes:

> There are different ways to look at the problem and attribute blame
> (e.g. since macosx enforces file names to be utf-8 (contrary to POSIX),
> `ls` in macosx *could* do the recoding of filenames reliably), but in
> any case I think it's clear for me that it's not a bug in Emacs.

So close this as notabug?  I thought a bit more about the workaround I
suggested above and I'm becoming less and less convinced that it's
worth the effort.

Best regards,
Stefan Kangas





^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2019-10-09 22:42     ` Peter Dyballa
@ 2019-11-11  1:49       ` Stefan Kangas
  2019-11-11 16:36         ` Peter Dyballa
  0 siblings, 1 reply; 47+ messages in thread
From: Stefan Kangas @ 2019-11-11  1:49 UTC (permalink / raw)
  To: Peter Dyballa; +Cc: Stefan Monnier, 4157

Peter Dyballa <Peter_Dyballa@Freenet.DE> writes:

>> Am 9.10.2019 um 21:47 schrieb Stefan Monnier <monnier@iro.umontreal.ca>:
>> 
>> but... can you
>> check what is your locale set to (and ideally, maybe, check what/who
>> sets it)?
>
> I am using three areas to set LANG and LC_CTYPE, each to the value of
> "de_DE.UTF-8." This is necessary because of macOS (and then Mac OS X). First
> it's ~/.MacOSX/environment.plist. This Property LIST file sets the two for the
> GUI login environment. Although all other processes should inherit from it I use
> constructs à la
>
> 	setenv LC_CTYPE	`defaults read ~/.MacOSX/environment LC_CTYPE`
>
> or
>
> 	export LC_CTYPE=`defaults read "${HOME}/.MacOSX/environment" LC_CTYPE`
>
> in ~/.cshrc resp. ~/.profile (my login shell is tcsh 6.18.01 (Astron) 2012-02-14 (x86_64-apple-darwin) options wide,nls,dl,bye,al,kan,sm,rh,color,filec), and also in my ~/.xinitrc file that X11 also inherits these settings.
>
> Dired is set up to use gls (now GNU coreutils 8.31). The system's /bin/ls neither understands -D nor --dired.
>
> Performing a
>
> 	(shell-command "printenv | sort" nil "*stderr*")
>
> in mini-buffer reports, among others:
>
> 	LANG=de_DE.UTF-8
> 	LC_CTYPE=de_DE.UTF-8

Thanks.  Are you still able to reproduce the original issue here?

Best regards,
Stefan Kangas





^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2019-11-11  1:49       ` Stefan Kangas
@ 2019-11-11 16:36         ` Peter Dyballa
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Dyballa @ 2019-11-11 16:36 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: Stefan Monnier, 4157


> Am 11.11.2019 um 02:49 schrieb Stefan Kangas <stefan@marxist.se>:
> 
> Thanks.  Are you still able to reproduce the original issue here?

No. In GNU Emacs 23.4 the environment variables are reset to one: LC_NUMERIC=C. So the localised "Mär" becomes a "Mar."

--
Greetings

  Pete

To most people solutions mean finding the answers. But to chemists solutions
are things that are still all mixed up.






^ permalink raw reply	[flat|nested] 47+ messages in thread

* bug#4157: 23.1.50; faulty character characterisation for ä
  2019-10-10 13:12         ` Stefan Kangas
@ 2019-11-17 20:58           ` Stefan Kangas
  0 siblings, 0 replies; 47+ messages in thread
From: Stefan Kangas @ 2019-11-17 20:58 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Peter Dyballa, 4157

tags 4157 + notabug
close 4157
thanks

Stefan Kangas <stefan@marxist.se> writes:

> Stefan Monnier <monnier@iro.umontreal.ca> writes:
>
>> > Perhaps you're just not supposed to use anything but UTF-8 on macOS?
>>
>> Exactly.
>>
>> I'd argue this applies to GNU/Linux as well nowadays, but macsox has
>> been utf-8 only since the very beginning, AFAIK, and it (for example)
>> enforces that file names are in utf-8, so using anything else than utf-8
>> in macosx is asking for trouble.
>
> FWIW, I would've nothing against closing this as wontfix, and write
> this up as a case of "don't do that then".

No further comments here within 5 weeks.  From the above, the decision
here seems to be that we only support UTF-8 locales on macOS.  This is
the default, and there should be no reason to change it.

I'm therefore closing this as notabug.

Best regards,
Stefan Kangas





^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2019-11-17 20:58 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-16  2:19 bug#4157: 23.1.50; faulty character characterisation for ä Peter Dyballa
2009-08-18  1:09 ` Kenichi Handa
2009-08-18 13:40   ` Peter Dyballa
2009-08-19  0:23     ` bug#4157: " Kenichi Handa
2009-08-19 22:47       ` Peter Dyballa
2009-08-24 11:30       ` Peter Dyballa
2009-08-24 12:22         ` bug#4157: " Kenichi Handa
2009-08-24 15:21           ` Peter Dyballa
2009-08-25  0:46             ` bug#4157: " Kenichi Handa
2009-08-25  7:51               ` Peter Dyballa
2009-08-25 22:19           ` Peter Dyballa
2009-08-27  6:52             ` bug#4157: " Kenichi Handa
2009-08-27  8:50               ` Peter Dyballa
2009-08-27 11:33                 ` bug#4157: " Kenichi Handa
2009-08-27 12:38                   ` Peter Dyballa
2009-08-28 19:27           ` Peter Dyballa
2009-08-31 21:11           ` Peter Dyballa
2009-09-01  0:04             ` Stefan Monnier
2009-09-04  0:58               ` Kenichi Handa
2009-08-22  4:09 ` Stefan Monnier
2009-08-22  8:50   ` Peter Dyballa
2009-08-23  1:49     ` Stefan Monnier
2009-08-23  9:57       ` Peter Dyballa
2019-10-09 14:29 ` Stefan Kangas
2019-10-09 18:48   ` Eli Zaretskii
2019-10-09 19:47   ` Stefan Monnier
2019-10-09 22:42     ` Peter Dyballa
2019-11-11  1:49       ` Stefan Kangas
2019-11-11 16:36         ` Peter Dyballa
2019-10-10  0:10     ` Stefan Kangas
2019-10-10  7:20       ` Eli Zaretskii
2019-10-10 10:36         ` Stefan Kangas
2019-10-10 11:20           ` Eli Zaretskii
2019-10-10 11:52             ` Stefan Kangas
2019-10-10 12:39               ` Stefan Kangas
2019-10-10 12:41                 ` Stefan Kangas
2019-10-10 18:33             ` Peter Dyballa
2019-10-10 18:57               ` Eli Zaretskii
2019-10-10 21:07                 ` Stefan Monnier
2019-10-11 13:33                   ` Stefan Kangas
2019-10-11  7:10               ` Andreas Schwab
2019-10-11  7:23                 ` Peter Dyballa
2019-10-10  8:15       ` Andreas Schwab
2019-10-10 12:54       ` Stefan Monnier
2019-10-10 13:12         ` Stefan Kangas
2019-11-17 20:58           ` Stefan Kangas
  -- strict thread matches above, loose matches on Subject: below --
2009-09-04  5:51 川幡太一

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).