Strange behaviour with dired and UTF8

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Strange behaviour with dired and UTF8
@ 2003-04-24 11:43 Jan D.
  2003-04-25 13:20 ` Kai Großjohann
  2003-05-01  6:52 ` Kenichi Handa
  0 siblings, 2 replies; 28+ messages in thread
From: Jan D. @ 2003-04-24 11:43 UTC (permalink / raw)


Hello.

Maybe I am doing this wrong, but here is what I try to do.
My language environment is ISO-8859-1.
I have a directory that contains files with file names in UTF-8.
I start dired on that directory.  I want to see the UTF-8 characters
so I do C-x RET r utf-8.  File names display OK now.

But when trying to operate on a file, say opening it, I get
"File no longer exists; type `g' to update Dired buffer"
It seems that dired does not keep the original file name around, but
tries to open with the display name representation of the file name.

When I type g, I loose the UTF-8 coding and files are now displayed
as ISO-8859-1 again.  Setting buffer coding to UTF-8 does not help.

Do I have to set file-name-coding-system to UTF-8?  This solves the
problem, but my file-name-coding-system is really ISO-8859-1, it is
just this one directory that is UTF-8.

Thanks,

	Jan D.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Strange behaviour with dired and UTF8
  2003-04-24 11:43 Strange behaviour with dired and UTF8 Jan D.
@ 2003-04-25 13:20 ` Kai Großjohann
  2003-05-01  6:52 ` Kenichi Handa
  1 sibling, 0 replies; 28+ messages in thread
From: Kai Großjohann @ 2003-04-25 13:20 UTC (permalink / raw)

"Jan D." <jan.h.d@swipnet.se> writes:

> But when trying to operate on a file, say opening it, I get
> "File no longer exists; type `g' to update Dired buffer"
> It seems that dired does not keep the original file name around, but
> tries to open with the display name representation of the file name.

Yeah, it seems that's how dired operates: it inserts the output from
"ls -l" into the buffer and then does operations on that buffer to
find the file name and suchlike.

Hm.  And the "ls -l" output contains not only the file names, it also
contains the dates.

I was going to suggest to have dired-find-file bind
file-name-coding-system to the value used for reading the "ls -l"
output, but that will break when the date and the file names use
different encodings.

-- 
file-error; Data: (Opening input file no such file or directory ~/.signature)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Strange behaviour with dired and UTF8
  2003-04-24 11:43 Strange behaviour with dired and UTF8 Jan D.
  2003-04-25 13:20 ` Kai Großjohann
@ 2003-05-01  6:52 ` Kenichi Handa
  2003-05-02  6:41   ` Kai Großjohann
  2003-05-02  8:16   ` Jan D.
  1 sibling, 2 replies; 28+ messages in thread
From: Kenichi Handa @ 2003-05-01  6:52 UTC (permalink / raw)
  Cc: emacs-devel

In article <200304241235.h3OCZdbL023178@stubby.bodenonline.com>, "Jan D." <jan.h.d@swipnet.se> writes:
> Maybe I am doing this wrong, but here is what I try to do.
> My language environment is ISO-8859-1.
> I have a directory that contains files with file names in UTF-8.
> I start dired on that directory.  I want to see the UTF-8 characters
> so I do C-x RET r utf-8.  File names display OK now.

> But when trying to operate on a file, say opening it, I get
> "File no longer exists; type `g' to update Dired buffer"
> It seems that dired does not keep the original file name around, but
> tries to open with the display name representation of the file name.

> When I type g, I loose the UTF-8 coding and files are now displayed
> as ISO-8859-1 again.  Setting buffer coding to UTF-8 does not help.

> Do I have to set file-name-coding-system to UTF-8?  This solves the
> problem, but my file-name-coding-system is really ISO-8859-1, it is
> just this one directory that is UTF-8.

The current Emacs doesn't have a facility to cope with such
a situation well.

How about this?

(1) Make a customizable variable
    file-name-coding-system-alist; the format is the same as
    file-coding-system-alist.

(2) Make the macro ENCODE_FILE and DECODE_FILE to check that
    variable before using file-name-coding-system and
    default-file-name-coding-system.

(3) Enhance the function dired-revert to update
    file-name-coding-system-alist automatically if it is
    called with coding-system-for-read being bound to
    non-nil.  In that case, it may also have to ask a user
    to save that modification for the future session (via
    customize).

What do people think?  Aren't there any better idea?

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Strange behaviour with dired and UTF8
  2003-05-01  6:52 ` Kenichi Handa
@ 2003-05-02  6:41   ` Kai Großjohann
  2003-05-02  8:16   ` Jan D.
  1 sibling, 0 replies; 28+ messages in thread
From: Kai Großjohann @ 2003-05-02  6:41 UTC (permalink / raw)


Kenichi Handa <handa@m17n.org> writes:

> What do people think?  Aren't there any better idea?

Your idea sounds good to me.

Automatically saving the changes is potentially dangerous¹, but oh, well.

¹ Makes it easy for the user to shoot themselves in the foot: put
  setq statements for file-name-coding-system-alist after
  custom-set-variables, bam!
-- 
file-error; Data: (Opening input file no such file or directory ~/.signature)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Strange behaviour with dired and UTF8
  2003-05-01  6:52 ` Kenichi Handa
  2003-05-02  6:41   ` Kai Großjohann
@ 2003-05-02  8:16   ` Jan D.
  2003-05-02  8:56     ` Kenichi Handa
  1 sibling, 1 reply; 28+ messages in thread
From: Jan D. @ 2003-05-02  8:16 UTC (permalink / raw)
  Cc: emacs-devel

> In article <200304241235.h3OCZdbL023178@stubby.bodenonline.com>, "Jan 
> D." <jan.h.d@swipnet.se> writes:
>> Maybe I am doing this wrong, but here is what I try to do.
>> My language environment is ISO-8859-1.
>> I have a directory that contains files with file names in UTF-8.
>> I start dired on that directory.  I want to see the UTF-8 characters
>> so I do C-x RET r utf-8.  File names display OK now.
>
>> But when trying to operate on a file, say opening it, I get
>> "File no longer exists; type `g' to update Dired buffer"
>> It seems that dired does not keep the original file name around, but
>> tries to open with the display name representation of the file name.
>
>> When I type g, I loose the UTF-8 coding and files are now displayed
>> as ISO-8859-1 again.  Setting buffer coding to UTF-8 does not help.
>
>> Do I have to set file-name-coding-system to UTF-8?  This solves the
>> problem, but my file-name-coding-system is really ISO-8859-1, it is
>> just this one directory that is UTF-8.
>
> The current Emacs doesn't have a facility to cope with such
> a situation well.
>
> How about this?
>
> (1) Make a customizable variable
>     file-name-coding-system-alist; the format is the same as
>     file-coding-system-alist.
>
> (2) Make the macro ENCODE_FILE and DECODE_FILE to check that
>     variable before using file-name-coding-system and
>     default-file-name-coding-system.
>
> (3) Enhance the function dired-revert to update
>     file-name-coding-system-alist automatically if it is
>     called with coding-system-for-read being bound to
>     non-nil.  In that case, it may also have to ask a user
>     to save that modification for the future session (via
>     customize).
>
> What do people think?  Aren't there any better idea?

This sounds very complicated.  As I understand it, dired first gets
the file name from ls (original representation), then converts that to
whatever encoding it shall use to show it in the buffer (view
representation).  When dired operates on the file (opening for example),
it converts back from the view representation, hoping to get the
original representation.  But this may fail, since conversion
from view back to original is not one-to-one.

This work (original representation -> view representation -> original
representation) should not be needed, IMHO.  Why just not keep the
original representation around (some kind of text property on the file
name?) and always use that when operating on the file?  That change 
would
be transparent to users.

I do not know how dired works, but I think a separation of original
representation and view representation would make it easier for
dired to use any encoding to view the files.

	Jan D.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Strange behaviour with dired and UTF8
  2003-05-02  8:16   ` Jan D.
@ 2003-05-02  8:56     ` Kenichi Handa
  2003-05-02  9:59       ` Jan D.
  2003-05-03 15:03       ` Richard Stallman
  0 siblings, 2 replies; 28+ messages in thread
From: Kenichi Handa @ 2003-05-02  8:56 UTC (permalink / raw)
  Cc: emacs-devel

In article <6DDE98F0-7C76-11D7-8080-00039363E640@swipnet.se>, "Jan D." <jan.h.d@swipnet.se> writes:
>>  How about this?
>> 
>>  (1) Make a customizable variable
>>      file-name-coding-system-alist; the format is the same as
>>      file-coding-system-alist.
>> 
>>  (2) Make the macro ENCODE_FILE and DECODE_FILE to check that
>>      variable before using file-name-coding-system and
>>      default-file-name-coding-system.
>> 
>>  (3) Enhance the function dired-revert to update
>>      file-name-coding-system-alist automatically if it is
>>      called with coding-system-for-read being bound to
>>      non-nil.  In that case, it may also have to ask a user
>>      to save that modification for the future session (via
>>      customize).
>> 
>>  What do people think?  Aren't there any better idea?

> This sounds very complicated.  As I understand it, dired first gets
> the file name from ls (original representation), then converts that to
> whatever encoding it shall use to show it in the buffer (view
> representation).  When dired operates on the file (opening for example),
> it converts back from the view representation, hoping to get the
> original representation.  But this may fail, since conversion
> from view back to original is not one-to-one.

It is sure that there's a possibility that encoding a
filename can't get the original filename.  But, Emacs anyway
can't handle such a filename.

> This work (original representation -> view representation -> original
> representation) should not be needed, IMHO.  Why just not keep the
> original representation around (some kind of text property on the file
> name?) and always use that when operating on the file?  That change 
> would be transparent to users.

A user may type C-x C-f FILENAME in the dired buffer.  With
the above method, we don't know how to encode FILENAME.

And, even if one types `f' to visit a file, in that file
buffer, we loose the information of the original
representation.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Strange behaviour with dired and UTF8
  2003-05-02  8:56     ` Kenichi Handa
@ 2003-05-02  9:59       ` Jan D.
  2003-05-02 11:22         ` Kenichi Handa
  2003-05-03 15:03       ` Richard Stallman
  1 sibling, 1 reply; 28+ messages in thread
From: Jan D. @ 2003-05-02  9:59 UTC (permalink / raw)
  Cc: emacs-devel

>> This sounds very complicated.  As I understand it, dired first gets
>> the file name from ls (original representation), then converts that to
>> whatever encoding it shall use to show it in the buffer (view
>> representation).  When dired operates on the file (opening for 
>> example),
>> it converts back from the view representation, hoping to get the
>> original representation.  But this may fail, since conversion
>> from view back to original is not one-to-one.
>
> It is sure that there's a possibility that encoding a
> filename can't get the original filename.  But, Emacs anyway
> can't handle such a filename.

Why not if it has the original filename?

>> This work (original representation -> view representation -> original
>> representation) should not be needed, IMHO.  Why just not keep the
>> original representation around (some kind of text property on the file
>> name?) and always use that when operating on the file?  That change
>> would be transparent to users.
>
> A user may type C-x C-f FILENAME in the dired buffer.  With
> the above method, we don't know how to encode FILENAME.

Why would this change?  I am only talking about file names that dired
reads from a directory.  No need to change C-x C-f.

> And, even if one types `f' to visit a file, in that file
> buffer, we loose the information of the original
> representation.

Then Emacs as a whole should change.  If I open a file from dired,
modify it and save it, I expect it to save to the same file name.
Are you saying there are situations where Emacs fails to do this?
That sounds like a major bug to me.  Maybe the buffer itself also
needs to keep the original file name around.

	Jan D.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Strange behaviour with dired and UTF8
  2003-05-02  9:59       ` Jan D.
@ 2003-05-02 11:22         ` Kenichi Handa
  2003-05-02 12:44           ` Jan D.
  0 siblings, 1 reply; 28+ messages in thread
From: Kenichi Handa @ 2003-05-02 11:22 UTC (permalink / raw)
  Cc: emacs-devel

In article <C667D673-7C84-11D7-B30E-00039363E640@swipnet.se>, "Jan D." <jan.h.d@swipnet.se> writes:
>>  It is sure that there's a possibility that encoding a
>>  filename can't get the original filename.  But, Emacs anyway
>>  can't handle such a filename.

> Why not if it has the original filename?

I'm talking about the general situation, not restricted to
dired.  I think this problem must be fixed in general cases,
not only for dired.  And, always carrying around the
original filename with a filename is one means.  But that
requires huge change to Emacs.  In addition, there are many
cases that modify a filename as a string.

>>>  This work (original representation -> view representation -> original
>>>  representation) should not be needed, IMHO.  Why just not keep the
>>>  original representation around (some kind of text property on the file
>>>  name?) and always use that when operating on the file?  That change
>>>  would be transparent to users.
>> 
>>  A user may type C-x C-f FILENAME in the dired buffer.  With
>>  the above method, we don't know how to encode FILENAME.

> Why would this change?  I am only talking about file names that dired
> reads from a directory.  No need to change C-x C-f.

Typing `f' works fine but C-x C-f doesn't, which is not a
good behaviour.

>>  And, even if one types `f' to visit a file, in that file
>>  buffer, we loose the information of the original
>>  representation.

> Then Emacs as a whole should change.

Yes, my proposal is to change Emacs' behavior as to filename
handing as a whole in a fairly low cost.

> If I open a file from dired, modify it and save it, I
> expect it to save to the same file name.  Are you saying
> there are situations where Emacs fails to do this?

No.  As far as I know, there's no system that allows
stateful encoding on filenames.  And if Emacs decodes a
filename by one of stateless coding systems (despite that it
is the correct one or not), it can be encoded back correctly
by the same coding system.  For instance, I think you can
open and save a file of utf-8 name in latin-1 lang. env. in
dired correctly (although the filename is not shown
correctly).

By the way, I've just thought of this weird situation.  One
has a file of utf-8 name in a directly of latin-1 name.  :-(
I think we can say sorry in such a case.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Strange behaviour with dired and UTF8
  2003-05-02 11:22         ` Kenichi Handa
@ 2003-05-02 12:44           ` Jan D.
  2003-05-03 15:03             ` Richard Stallman
                               ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Jan D. @ 2003-05-02 12:44 UTC (permalink / raw)
  Cc: emacs-devel

> >>  A user may type C-x C-f FILENAME in the dired buffer.  With
> >>  the above method, we don't know how to encode FILENAME.
> 
> > Why would this change?  I am only talking about file names that dired
> > reads from a directory.  No need to change C-x C-f.
> 
> Typing `f' works fine but C-x C-f doesn't, which is not a
> good behaviour.

I think I understand now.  You mean if dired uses UTF8, and file system
coding is Latin-1, C-x C-f would then use Latin-1, and possibly fail?

I agree that this is bad, but I am not sure anything can be done
about it.  Both KDE and GNOME file managers and file dialogs fail to open
the right file in certain cases.  I think it is worse if dired fails on
'f' since in that case the file name is supplied by dired, not the user.
For C-x C-f there is always TAB to see what Emacs thinks the file is called.

> 
> >>  And, even if one types `f' to visit a file, in that file
> >>  buffer, we loose the information of the original
> >>  representation.
> 
> > Then Emacs as a whole should change.
> 
> Yes, my proposal is to change Emacs' behavior as to filename
> handing as a whole in a fairly low cost.
> 

I am not sure your case covers all cases.  If a file name was
latin-1 and then converted to UTF8 (outside Emacs), Emacs would think it is
still latin-1, no?
It involves a bit of user interaction, making it intrusive.

> By the way, I've just thought of this weird situation.  One
> has a file of utf-8 name in a directly of latin-1 name.  :-(
> I think we can say sorry in such a case.

But then you would be using non-printable latin-1 characters.  I don't
think this is something one has to handle.

	Jan D.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Strange behaviour with dired and UTF8
  2003-05-02 12:44           ` Jan D.
@ 2003-05-03 15:03             ` Richard Stallman
  2003-05-03 18:04               ` Jan D.
  2003-05-03 15:59             ` Stephen J. Turnbull
  2003-05-05  9:20             ` Kenichi Handa
  2 siblings, 1 reply; 28+ messages in thread
From: Richard Stallman @ 2003-05-03 15:03 UTC (permalink / raw)
  Cc: handa

    I think I understand now.  You mean if dired uses UTF8, and file system
    coding is Latin-1,

Why would Dired use UTF8 if the file name encoding is Latin-1?
Is this because the user set up perverse settings?
Or is there some natural, normal set of options
for which this would occur?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Strange behaviour with dired and UTF8
  2003-05-02  8:56     ` Kenichi Handa
  2003-05-02  9:59       ` Jan D.
@ 2003-05-03 15:03       ` Richard Stallman
  2003-05-03 18:11         ` Jan D.
  2003-05-06  5:39         ` Kenichi Handa
  1 sibling, 2 replies; 28+ messages in thread
From: Richard Stallman @ 2003-05-03 15:03 UTC (permalink / raw)
  Cc: jan.h.d

It would be fundamentally clean to make sure that decoding of file
names is never many-one.  Is that possible?  Some of your messages
suggest it is already the case.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Strange behaviour with dired and UTF8
  2003-05-02 12:44           ` Jan D.
  2003-05-03 15:03             ` Richard Stallman
@ 2003-05-03 15:59             ` Stephen J. Turnbull
  2003-05-03 17:59               ` Jan D.
  2003-05-05  9:20             ` Kenichi Handa
  2 siblings, 1 reply; 28+ messages in thread
From: Stephen J. Turnbull @ 2003-05-03 15:59 UTC (permalink / raw)
  Cc: Kenichi Handa

>>>>> "Jan" == Jan D <jan.h.d@swipnet.se> writes:

    Jan> But then you would be using non-printable latin-1 characters.

That's impossible.  By definition, all Latin 1 is printable.

    Jan> I don't think this is something one has to handle.

Maybe not you.  Emacs has high standards, though. :-)

-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Strange behaviour with dired and UTF8
  2003-05-03 15:59             ` Stephen J. Turnbull
@ 2003-05-03 17:59               ` Jan D.
  0 siblings, 0 replies; 28+ messages in thread
From: Jan D. @ 2003-05-03 17:59 UTC (permalink / raw)
  Cc: emacs-devel

lördagen den 3 maj 2003 kl 17.59 skrev Stephen J. Turnbull:

>>>>>> "Jan" == Jan D <jan.h.d@swipnet.se> writes:
>
>     Jan> But then you would be using non-printable latin-1 characters.
>
> That's impossible.  By definition, all Latin 1 is printable

Then every implementation of isprint() is wrong :-).  Character
128-159 is not printable, I think.  Nor is the non-printable part
that is in the ASCII subset.

>     Jan> I don't think this is something one has to handle.
>
> Maybe not you.  Emacs has high standards, though. :-)

Emacs would have to be able to read minds to do this correctly.
It is possible to make a buffer contain characters that looks fine
when viewed as UTF-8, but Emacs can not know if the user actually
wanted this to be latin-1.  It is just an interpretation of how
octets shall be viewed.  That is why I would like to say to
dired "Show me these file names interpreted as UTF-8" and then
later "show me these file names interpreted as latin-1", and also
be able to operate on the files.

	Jan D.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Strange behaviour with dired and UTF8
  2003-05-03 15:03             ` Richard Stallman
@ 2003-05-03 18:04               ` Jan D.
  2003-05-05 14:32                 ` Richard Stallman
  0 siblings, 1 reply; 28+ messages in thread
From: Jan D. @ 2003-05-03 18:04 UTC (permalink / raw)
  Cc: emacs-devel

lördagen den 3 maj 2003 kl 17.03 skrev Richard Stallman:

>     I think I understand now.  You mean if dired uses UTF8, and file 
> system
>     coding is Latin-1,
>
> Why would Dired use UTF8 if the file name encoding is Latin-1?
> Is this because the user set up perverse settings?
> Or is there some natural, normal set of options
> for which this would occur?

The situiation I have is that there are directories with file names in
different encodings.  Latin-1 is most frequent, which is why I say
file name encoding is latin-1.  But some directories contain other
encodings, UTF-8 among them.  Some of these are on network file
systems, so I have no control over them.  But I would like to
be able to view them in Emacs.

I guess UTF-8 will win out in the end, but there are a lot of old
systems around.

	Jan D.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Strange behaviour with dired and UTF8
  2003-05-03 15:03       ` Richard Stallman
@ 2003-05-03 18:11         ` Jan D.
  2003-05-06  5:39         ` Kenichi Handa
  1 sibling, 0 replies; 28+ messages in thread
From: Jan D. @ 2003-05-03 18:11 UTC (permalink / raw)
  Cc: Kenichi Handa

> It would be fundamentally clean to make sure that decoding of file
> names is never many-one.  Is that possible?  Some of your messages
> suggest it is already the case.

I don't think it is possible as long as Emacs only has one file
system encoding (file-name-coding-system).

The original problem is this:

file-name-coding-system is latin-1
Open dired on a directory with UTF-8 file names.
Do C-x RET r utf-8.
Try to operate on a file with non-ascii characters gives
   "File no longer exists; type `g' to update Dired buffer"

This is because when decoding the file name Emacs uses latin-1 and thus
doesn't get the original file name back.

As long as there can be file names with different encodings this problem
can occur.

	Jan D.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Strange behaviour with dired and UTF8
  2003-05-02 12:44           ` Jan D.
  2003-05-03 15:03             ` Richard Stallman
  2003-05-03 15:59             ` Stephen J. Turnbull
@ 2003-05-05  9:20             ` Kenichi Handa
  2003-05-06 18:05               ` Jan D.
  2 siblings, 1 reply; 28+ messages in thread
From: Kenichi Handa @ 2003-05-05  9:20 UTC (permalink / raw)
  Cc: emacs-devel

In article <200305021336.h42DaHbN022640@stubby.bodenonline.com>, "Jan D." <jan.h.d@swipnet.se> writes:
> I think I understand now.  You mean if dired uses UTF8, and file system
> coding is Latin-1, C-x C-f would then use Latin-1, and possibly fail?

Yes.

> I agree that this is bad, but I am not sure anything can be done
> about it.

How about my proposal?   Doesn't it solve this problem?

> Both KDE and GNOME file managers and file dialogs fail to open
> the right file in certain cases.  I think it is worse if dired fails on
> 'f' since in that case the file name is supplied by dired, not the user.
> For C-x C-f there is always TAB to see what Emacs thinks the file is called.

But, *Completion* buffer doesn't show correct file names
because there are names encoded by latin-1.  How one can
choose what he want?  In addtion, TAB says "[no match]" if
one has already typed some non-ASCII characters.

> I am not sure your case covers all cases.  If a file name was
> latin-1 and then converted to UTF8 (outside Emacs), Emacs would think it is
> still latin-1, no?
> It involves a bit of user interaction, making it intrusive.

Yes, but I think Emacs doesn't have to care about such a
case.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Strange behaviour with dired and UTF8
  2003-05-03 18:04               ` Jan D.
@ 2003-05-05 14:32                 ` Richard Stallman
  2003-05-07 15:51                   ` Jan D.
  0 siblings, 1 reply; 28+ messages in thread
From: Richard Stallman @ 2003-05-05 14:32 UTC (permalink / raw)
  Cc: emacs-devel

    The situiation I have is that there are directories with file names in
    different encodings.  Latin-1 is most frequent, which is why I say
    file name encoding is latin-1.  But some directories contain other
    encodings, UTF-8 among them.

Perhaps what we should do is record the proper coding system to use
for a given buffer's file name string.  That way, when you visit a
buffer from a directory whose names are UTF-8 encoded, the
buffer will say "use UTF-8 to encode my file name."

We could also conceivably record this info in the file-name string
itself; but I have a bad feeling that that will lead to some sort
of incoherence that I cannot see at present.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Strange behaviour with dired and UTF8
  2003-05-03 15:03       ` Richard Stallman
  2003-05-03 18:11         ` Jan D.
@ 2003-05-06  5:39         ` Kenichi Handa
  2003-05-06 14:41           ` Richard Stallman
  2003-05-07 15:49           ` Jan D.
  1 sibling, 2 replies; 28+ messages in thread
From: Kenichi Handa @ 2003-05-06  5:39 UTC (permalink / raw)
  Cc: jan.h.d

In article <E19ByYD-0000t0-00@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:
> It would be fundamentally clean to make sure that decoding of file
> names is never many-one.  Is that possible?

For that, we must inhibit to set file-name-coding-system to
such a coding system that will do many-to-one decoding
(e.g. iso-2022-jp).  But, we don't have a general mechanism
to inhibit a symbol to be bound to a specific value.

> Some of your messages suggest it is already the case.

As far as I know, there's no system that allows a coding
system that does many-to-one decoding for filenames.  So, we
don't have to care such a case.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Strange behaviour with dired and UTF8
  2003-05-06  5:39         ` Kenichi Handa
@ 2003-05-06 14:41           ` Richard Stallman
  2003-05-07 15:49           ` Jan D.
  1 sibling, 0 replies; 28+ messages in thread
From: Richard Stallman @ 2003-05-06 14:41 UTC (permalink / raw)
  Cc: jan.h.d

    For that, we must inhibit to set file-name-coding-system to
    such a coding system that will do many-to-one decoding
    (e.g. iso-2022-jp).  But, we don't have a general mechanism
    to inhibit a symbol to be bound to a specific value.

That's one way to do it.  Another would be to refuse to use such a value
if the symbol does have it.  Another way is to discourage users from
using such coding systems.

    As far as I know, there's no system that allows a coding
    system that does many-to-one decoding for filenames.  So, we
    don't have to care such a case.

It sounds like the third method has already been implemented.  That's
good.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Strange behaviour with dired and UTF8
  2003-05-05  9:20             ` Kenichi Handa
@ 2003-05-06 18:05               ` Jan D.
  2003-05-07  1:08                 ` Kenichi Handa
  0 siblings, 1 reply; 28+ messages in thread
From: Jan D. @ 2003-05-06 18:05 UTC (permalink / raw)
  Cc: emacs-devel

>> I agree that this is bad, but I am not sure anything can be done
>> about it.
>
> How about my proposal?   Doesn't it solve this problem?

It depends on what the file-name-coding-system-alist looks like.  If it
contains full file name path, it could.  Maybe it is best to try it.

I think it is bad to hawe multiple information sources that has to
be consulted to figure out the original file name (the display file
name, the buffer encoding, file system encoding, and the new alist).
At some point Emacs must have had the original file name.  It is a
shame to throw away that knowledge and then try to reconstruct it.

>> Both KDE and GNOME file managers and file dialogs fail to open
>> the right file in certain cases.  I think it is worse if dired fails 
>> on
>> 'f' since in that case the file name is supplied by dired, not the 
>> user.
>> For C-x C-f there is always TAB to see what Emacs thinks the file is 
>> called.
>
> But, *Completion* buffer doesn't show correct file names
> because there are names encoded by latin-1.  How one can
> choose what he want?  In addtion, TAB says "[no match]" if
> one has already typed some non-ASCII characters.

An other approach would be to always keep file names as is (i.e.
the original file name) and put some sort of property on it that is the
encoding.  This would require that the display engine can display these
with right encoding.  That way the manipulations is always done on and
with the original file name.

This is of course some work.

>> I am not sure your case covers all cases.  If a file name was
>> latin-1 and then converted to UTF8 (outside Emacs), Emacs would think 
>> it is
>> still latin-1, no?
>> It involves a bit of user interaction, making it intrusive.
>
> Yes, but I think Emacs doesn't have to care about such a
> case.

Why not?  I think this is about as bad as the failure of the
*Completion*  buffer.  Maybe worse, because you can not open the file
at all.

	Jan D.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Strange behaviour with dired and UTF8
  2003-05-06 18:05               ` Jan D.
@ 2003-05-07  1:08                 ` Kenichi Handa
  2003-05-07 15:43                   ` Jan D.
  0 siblings, 1 reply; 28+ messages in thread
From: Kenichi Handa @ 2003-05-07  1:08 UTC (permalink / raw)
  Cc: emacs-devel

In article <6129D384-7FED-11D7-81D0-00039363E640@swipnet.se>, "Jan D." <jan.h.d@swipnet.se> writes:
>>>  I agree that this is bad, but I am not sure anything can be done
>>>  about it.
>> 
>>  How about my proposal?   Doesn't it solve this problem?

> It depends on what the file-name-coding-system-alist looks like.  If it
> contains full file name path, it could.  Maybe it is best to try it.

It should contain a regular expression matching a directory
or a file name.

> I think it is bad to hawe multiple information sources that has to
> be consulted to figure out the original file name (the display file
> name, the buffer encoding, file system encoding, and the new alist).
> At some point Emacs must have had the original file name.  It is a
> shame to throw away that knowledge and then try to reconstruct it.

Unless we have a mechanism to always keep that knowlege, it
is not reliable.  For instance, even if we keep the original
filename as a text property of a filename string, a filename
string may be modified in various ways and make the property
value obsolete.  And, I don't know if the names listed in
*Completion* buffer can keep that property.

So, I think keeping the information about the original
filename in an alist is the most reliable way.  In addition,
we can use that information in the future emacs session,
which is also an important point.

> An other approach would be to always keep file names as is (i.e.
> the original file name) and put some sort of property on it that is the
> encoding.  This would require that the display engine can display these
> with right encoding.  That way the manipulations is always done on and
> with the original file name.

I strongly oppose to that method.  Emacs should not work on
undecoded raw bytes.  A filename is a kind of text, and thus
a user should be able to handle it as a text (edit,
copy&paste, etc).

>>>  I am not sure your case covers all cases.  If a file name was
>>>  latin-1 and then converted to UTF8 (outside Emacs), Emacs would think 
>>>  it is
>>>  still latin-1, no?
>>>  It involves a bit of user interaction, making it intrusive.
>> 
>>  Yes, but I think Emacs doesn't have to care about such a
>>  case.

> Why not?  I think this is about as bad as the failure of the
> *Completion*  buffer.  Maybe worse, because you can not open the file
> at all.

If that filename is recoded as latin-1 in
file-name-coding-system-alist, we can open that file by
customizing file-name-coding-system-alist.  If that filename
is not recoded in the alist, we can open that file by
switching to utf-8 lang. env., or by setting
file-name-coding-system to utf-8, or by customizing
file-name-coding-system-alist.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Strange behaviour with dired and UTF8
  2003-05-07  1:08                 ` Kenichi Handa
@ 2003-05-07 15:43                   ` Jan D.
  0 siblings, 0 replies; 28+ messages in thread
From: Jan D. @ 2003-05-07 15:43 UTC (permalink / raw)
  Cc: emacs-devel

> In article <6129D384-7FED-11D7-81D0-00039363E640@swipnet.se>, "Jan D." 
> <jan.h.d@swipnet.se> writes:
>>>>  I agree that this is bad, but I am not sure anything can be done
>>>>  about it.
>>>
>>>  How about my proposal?   Doesn't it solve this problem?
>
>> It depends on what the file-name-coding-system-alist looks like.  If 
>> it
>> contains full file name path, it could.  Maybe it is best to try it.
>
> It should contain a regular expression matching a directory
> or a file name.

Can you give an example?

> So, I think keeping the information about the original
> filename in an alist is the most reliable way.  In addition,
> we can use that information in the future emacs session,
> which is also an important point.

Here the danger of the two unrelated information sources to get out of
sync is apparent.


> I strongly oppose to that method.  Emacs should not work on
> undecoded raw bytes.  A filename is a kind of text, and thus
> a user should be able to handle it as a text (edit,
> copy&paste, etc).

It is more than that, it is an identifier to an entity that is external
to Emacs.  Normal text is not that.  When using it as an identifier it
should work on undecoded raw bytes (it tries to do that today, by
converting back from the display representation to the original
representation).  There is nothing that prevents edit of the text.

>>>>  I am not sure your case covers all cases.  If a file name was
>>>>  latin-1 and then converted to UTF8 (outside Emacs), Emacs would 
>>>> think
>>>>  it is
>>>>  still latin-1, no?
>>>>  It involves a bit of user interaction, making it intrusive.
>>>
>>>  Yes, but I think Emacs doesn't have to care about such a
>>>  case.
>
>> Why not?  I think this is about as bad as the failure of the
>> *Completion*  buffer.  Maybe worse, because you can not open the file
>> at all.
>
> If that filename is recoded as latin-1 in
> file-name-coding-system-alist, we can open that file by
> customizing file-name-coding-system-alist.  If that filename
> is not recoded in the alist, we can open that file by
> switching to utf-8 lang. env., or by setting
> file-name-coding-system to utf-8, or by customizing
> file-name-coding-system-alist.

Who is "we" that is doing all this?  The user, Emacs, someone else?
It seems as a lot of user interaction, but maybe you have another
mechanism in mind?

	Jan D.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Strange behaviour with dired and UTF8
  2003-05-06  5:39         ` Kenichi Handa
  2003-05-06 14:41           ` Richard Stallman
@ 2003-05-07 15:49           ` Jan D.
  2003-05-07 16:31             ` Stefan Monnier
  1 sibling, 1 reply; 28+ messages in thread
From: Jan D. @ 2003-05-07 15:49 UTC (permalink / raw)
  Cc: emacs-devel

> As far as I know, there's no system that allows a coding
> system that does many-to-one decoding for filenames.  So, we
> don't have to care such a case.

I don't understand what you mean here.  We must be talking about
different things.

Say I have two files, one in UTF-8 and one in latin-1.  Emacs has only
one coding system for file names, say it is latin-1.

Now, since Emacs only has one coding system, it assumes there is
a one-to-one correspondence between file names and encodings.
Clearly this is not the case.  In this case there are two separate
mappings to map from display string to original file name.  That is
what I mean with many-to-one.

	Jan D.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Strange behaviour with dired and UTF8
  2003-05-05 14:32                 ` Richard Stallman
@ 2003-05-07 15:51                   ` Jan D.
  2003-05-07 16:09                     ` Stefan Monnier
  0 siblings, 1 reply; 28+ messages in thread
From: Jan D. @ 2003-05-07 15:51 UTC (permalink / raw)
  Cc: handa

>     The situiation I have is that there are directories with file 
> names in
>     different encodings.  Latin-1 is most frequent, which is why I say
>     file name encoding is latin-1.  But some directories contain other
>     encodings, UTF-8 among them.
>
> Perhaps what we should do is record the proper coding system to use
> for a given buffer's file name string.  That way, when you visit a
> buffer from a directory whose names are UTF-8 encoded, the
> buffer will say "use UTF-8 to encode my file name."

This is basically Handa:s proposal.

> We could also conceivably record this info in the file-name string
> itself; but I have a bad feeling that that will lead to some sort
> of incoherence that I cannot see at present.

This is basically my proposal.

I think Handa:s proposal is easier to implement.

	Jan D.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Strange behaviour with dired and UTF8
  2003-05-07 15:51                   ` Jan D.
@ 2003-05-07 16:09                     ` Stefan Monnier
  2003-05-09 11:19                       ` Richard Stallman
  0 siblings, 1 reply; 28+ messages in thread
From: Stefan Monnier @ 2003-05-07 16:09 UTC (permalink / raw)
  Cc: handa

I don't exactly understand the Handa's proposal, so could someone
explain to me how it handles a situation such as /<foo>/<bar>
where <foo> is encoded in latin-1 and <bar> in utf-8 ?

	Stefan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Strange behaviour with dired and UTF8
  2003-05-07 15:49           ` Jan D.
@ 2003-05-07 16:31             ` Stefan Monnier
  2003-05-07 17:40               ` Jan D.
  0 siblings, 1 reply; 28+ messages in thread
From: Stefan Monnier @ 2003-05-07 16:31 UTC (permalink / raw)
  Cc: Kenichi Handa

> Say I have two files, one in UTF-8 and one in latin-1.  Emacs has only
> one coding system for file names, say it is latin-1.

Question: how do other applications deal with such situations ?

I mean, of course Emacs should do better than the rest of the crowd,
but if most/all other applications fail miserably, then it's unlikely
that people will use such setups and it would be wrong for Emacs to
make it easier to create such a setup (unless maybe only Emacs
will ever care about those file names, of course).

	Stefan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Strange behaviour with dired and UTF8
  2003-05-07 16:31             ` Stefan Monnier
@ 2003-05-07 17:40               ` Jan D.
  0 siblings, 0 replies; 28+ messages in thread
From: Jan D. @ 2003-05-07 17:40 UTC (permalink / raw)
  Cc: Kenichi Handa

onsdagen den 7 maj 2003 kl 18.31 skrev Stefan Monnier:

>> Say I have two files, one in UTF-8 and one in latin-1.  Emacs has only
>> one coding system for file names, say it is latin-1.
>
> Question: how do other applications deal with such situations ?
>
> I mean, of course Emacs should do better than the rest of the crowd,
> but if most/all other applications fail miserably, then it's unlikely
> that people will use such setups and it would be wrong for Emacs to
> make it easier to create such a setup (unless maybe only Emacs
> will ever care about those file names, of course).

I can only say that GNOME (Nautilus) deals with this fine, better than
most.  It can actually display two files, one in latin-1 and the other
in UTF-8 that has the same display representation so it looks like
the two files have the same name.  When clicking on them (to open
for example), it opens the correct file (I use the size of the files
to tell them apart).  When renaming a file, it uses UTF-8 always.
I think this is as good as it gets.

I don't know in detail, but given that UTF-8 is so fundamental to GNOME,
I think Nautilus first tries UTF-8, and if the name isn't valid UTF-8, 
it
tries the users locale.  Actually Nautilus behaves better than most 
other
GNOME applications.  For example, gedit always tries UTF-8 for 
displaying
the file name and says "invalid UTF-8" if that fails.

KDE (Konquerer) seems to use the locale character set always.

Other systems can change the view character set.  Much like you
can do in Netscape/Mozilla.  Open up a directory and then you can
toggle the coding system used to display file names (in Mozilla:
View -> Character coding).  This is what I thought Emacs could do, but
it lost the original file name.

	Jan D.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Strange behaviour with dired and UTF8
  2003-05-07 16:09                     ` Stefan Monnier
@ 2003-05-09 11:19                       ` Richard Stallman
  0 siblings, 0 replies; 28+ messages in thread
From: Richard Stallman @ 2003-05-09 11:19 UTC (permalink / raw)
  Cc: emacs-devel

    I don't exactly understand the Handa's proposal, so could someone
    explain to me how it handles a situation such as /<foo>/<bar>
    where <foo> is encoded in latin-1 and <bar> in utf-8 ?

If you literally mean that the absolute file name in the file system
consists of a Latin-1 part and a UTF-8 part, my first reaction would
have been "give up".  But it occurs to me that if Emacs decodes the
components one by one, it might be able to handle this case correctly
without too much work.

Re-encoding such names is more difficult.  I think the only possible method
is to record the proper coding system in text properties in the string.
We would have to make expand-file-name preserve these properties when
it makes sense; likewise other functions that operate on file names.

It adds up to a fair amount of work--not impossible, but perhaps
not worth the trouble.

    I mean, of course Emacs should do better than the rest of the crowd,
    but if most/all other applications fail miserably, then it's unlikely
    that people will use such setups and it would be wrong for Emacs to
    make it easier to create such a setup

I agree with that point.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2003-05-09 11:19 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-04-24 11:43 Strange behaviour with dired and UTF8 Jan D.
2003-04-25 13:20 ` Kai Großjohann
2003-05-01  6:52 ` Kenichi Handa
2003-05-02  6:41   ` Kai Großjohann
2003-05-02  8:16   ` Jan D.
2003-05-02  8:56     ` Kenichi Handa
2003-05-02  9:59       ` Jan D.
2003-05-02 11:22         ` Kenichi Handa
2003-05-02 12:44           ` Jan D.
2003-05-03 15:03             ` Richard Stallman
2003-05-03 18:04               ` Jan D.
2003-05-05 14:32                 ` Richard Stallman
2003-05-07 15:51                   ` Jan D.
2003-05-07 16:09                     ` Stefan Monnier
2003-05-09 11:19                       ` Richard Stallman
2003-05-03 15:59             ` Stephen J. Turnbull
2003-05-03 17:59               ` Jan D.
2003-05-05  9:20             ` Kenichi Handa
2003-05-06 18:05               ` Jan D.
2003-05-07  1:08                 ` Kenichi Handa
2003-05-07 15:43                   ` Jan D.
2003-05-03 15:03       ` Richard Stallman
2003-05-03 18:11         ` Jan D.
2003-05-06  5:39         ` Kenichi Handa
2003-05-06 14:41           ` Richard Stallman
2003-05-07 15:49           ` Jan D.
2003-05-07 16:31             ` Stefan Monnier
2003-05-07 17:40               ` Jan D.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).