unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Feature request: multibyte user-full-name
@ 2006-03-12  7:35 AIDA Shinra
  2006-03-14  1:48 ` Kenichi Handa
  0 siblings, 1 reply; 12+ messages in thread
From: AIDA Shinra @ 2006-03-12  7:35 UTC (permalink / raw)


Hello,

user-full-name might contain non-ASCII characters. For example,
pw_gecos is encoded in UTF-8 on Darwin.

No technical problems exist except in which coding system should Emacs
decode the username. We have three options:

1. Introduce something like directory-system-coding-system and guess
it in set-locale-environment.

2. Apply file-name-coding-system and pray that it works.

3. Hardcode for each platform.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Feature request: multibyte user-full-name
  2006-03-12  7:35 Feature request: multibyte user-full-name AIDA Shinra
@ 2006-03-14  1:48 ` Kenichi Handa
  2006-03-14  3:18   ` AIDA Shinra
  0 siblings, 1 reply; 12+ messages in thread
From: Kenichi Handa @ 2006-03-14  1:48 UTC (permalink / raw)
  Cc: emacs-devel

In article <m2acbwm7xd.wl%shinra@j10n.org>, AIDA Shinra <shinra@j10n.org> writes:

> Hello,
> user-full-name might contain non-ASCII characters. For example,
> pw_gecos is encoded in UTF-8 on Darwin.

> No technical problems exist except in which coding system should Emacs
> decode the username. We have three options:

> 1. Introduce something like directory-system-coding-system and guess
> it in set-locale-environment.

> 2. Apply file-name-coding-system and pray that it works.

> 3. Hardcode for each platform.

Why do you think that pw_gecos is related to something like
directory or file name?

Anyway, as far as a system allows users to switch locale, I
think, pw_gecos must adopt locale-independent encoding, thus
the possible encoding is one of UTF-*.  And, considering
backward compatibility, it should be UTF-8.  Then, how about
we always decode it by utf-8 (only if it contains a byte
with MSB set) while falling back to locale-coding-system
(invalid utf-8 sequence is found), and see if that works on
any systems?   How does GNU/Linux encode it?

By the way, does the mis-decoding of user-full-name lead to
any serious error?

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Feature request: multibyte user-full-name
  2006-03-14  1:48 ` Kenichi Handa
@ 2006-03-14  3:18   ` AIDA Shinra
  2006-03-14  4:54     ` Zhang Wei
                       ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: AIDA Shinra @ 2006-03-14  3:18 UTC (permalink / raw)
  Cc: emacs-devel

> > Hello,
> > user-full-name might contain non-ASCII characters. For example,
> > pw_gecos is encoded in UTF-8 on Darwin.
> 
> > No technical problems exist except in which coding system should Emacs
> > decode the username. We have three options:
> 
> > 1. Introduce something like directory-system-coding-system and guess
> > it in set-locale-environment.
> 
> > 2. Apply file-name-coding-system and pray that it works.
> 
> > 3. Hardcode for each platform.
> 
> Why do you think that pw_gecos is related to something like
> directory or file name?

About 1: "directory system" is my miswording. I meant "directory
service".

About 2: *Pray* that an operating system and/or administrator adopt
the same encoding.

> Anyway, as far as a system allows users to switch locale, I
> think, pw_gecos must adopt locale-independent encoding, thus
> the possible encoding is one of UTF-*.  And, considering
> backward compatibility, it should be UTF-8.  Then, how about
> we always decode it by utf-8 (only if it contains a byte
> with MSB set) while falling back to locale-coding-system
> (invalid utf-8 sequence is found), and see if that works on
> any systems?   How does GNU/Linux encode it?

A site administrator might choose an encoding other than UTF-8 even if
it is locale-dependent...

> By the way, does the mis-decoding of user-full-name lead to
> any serious error?

I can't determine your "serious" means but user-full-name is widely
used anyway.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Feature request: multibyte user-full-name
  2006-03-14  3:18   ` AIDA Shinra
@ 2006-03-14  4:54     ` Zhang Wei
  2006-03-14  6:22       ` Miles Bader
  2006-03-14  5:44     ` Kenichi Handa
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Zhang Wei @ 2006-03-14  4:54 UTC (permalink / raw)



[-- Attachment #1.1: Type: text/plain, Size: 1615 bytes --]

AIDA Shinra <shinra@j10n.org> writes:

>> Anyway, as far as a system allows users to switch locale, I
>> think, pw_gecos must adopt locale-independent encoding, thus
>> the possible encoding is one of UTF-*.  And, considering
>> backward compatibility, it should be UTF-8.  Then, how about
>> we always decode it by utf-8 (only if it contains a byte
>> with MSB set) while falling back to locale-coding-system
>> (invalid utf-8 sequence is found), and see if that works on
>> any systems?   How does GNU/Linux encode it?
>
> A site administrator might choose an encoding other than UTF-8 even if
> it is locale-dependent...

Encoding is a *big* problem is the world of GNU/Linux, mainly comes
From the lack of standard. 

For example, if two users select different locale in the same
GNU/Linux system, userA select zh_CN.GB2312, userB select zh_CN.UTF-8,
because all of the filename related system call write/read filename
*as is*, that means userA's filename is encoded by GB2312, userB's
filename is encoded by UTF-8, that result in they can't read eath
other's filename, one's filename is a completely mess to the other.

Internet makes this problem even worse considering no standard
encoding is used for information exchange, ssh, ftp, vnc ..., almost
every of them will encounter encoding problem everyday.

Perhaps the ultimate best solution to this problem is, at the system
call level, filename is converted to utf-8 no matter what ever the
locale a user choose. If it is hard to do in the kernel, at least we
should do it at the libc level.

-- 
GnuPG KeyID: 0x61C92BB9

[-- Attachment #1.2: Type: application/pgp-signature, Size: 188 bytes --]

[-- Attachment #2: Type: text/plain, Size: 142 bytes --]

_______________________________________________
Emacs-devel mailing list
Emacs-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Feature request: multibyte user-full-name
  2006-03-14  3:18   ` AIDA Shinra
  2006-03-14  4:54     ` Zhang Wei
@ 2006-03-14  5:44     ` Kenichi Handa
  2006-03-14 16:17       ` AIDA Shinra
  2006-03-14 16:07     ` Stefan Monnier
  2006-03-14 22:36     ` Stefan Monnier
  3 siblings, 1 reply; 12+ messages in thread
From: Kenichi Handa @ 2006-03-14  5:44 UTC (permalink / raw)
  Cc: emacs-devel

In article <m27j6xww7d.wl%shinra@j10n.org>, AIDA Shinra <shinra@j10n.org> writes:

>> Why do you think that pw_gecos is related to something like
>> directory or file name?

> About 1: "directory system" is my miswording. I meant "directory
> service".

Ah, I see, but as for "directory service", we already have
ldap-coding-system (lisp/net/ldap.el).  It seems that adding
"directory-service-coding-system" is confusing.  The most
specific variable name will be
"gecos-coding-system"... not that attractive.

> About 2: *Pray* that an operating system and/or administrator adopt
> the same encoding.

>> Anyway, as far as a system allows users to switch locale, I
>> think, pw_gecos must adopt locale-independent encoding, thus
>> the possible encoding is one of UTF-*.  And, considering
>> backward compatibility, it should be UTF-8.  Then, how about
>> we always decode it by utf-8 (only if it contains a byte
>> with MSB set) while falling back to locale-coding-system
>> (invalid utf-8 sequence is found), and see if that works on
>> any systems?   How does GNU/Linux encode it?

> A site administrator might choose an encoding other than UTF-8 even if
> it is locale-dependent...

Ummm.

>> By the way, does the mis-decoding of user-full-name lead to
>> any serious error?

> I can't determine your "serious" means but user-full-name is widely
> used anyway.

I mean an unrecoverable error.  When Emacs incorrectly
decodes user-full-name, if it can be recoverable by trying
the same command again after setting a proper coding system
to some variable, it may not be that serious error.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Feature request: multibyte user-full-name
  2006-03-14  4:54     ` Zhang Wei
@ 2006-03-14  6:22       ` Miles Bader
  0 siblings, 0 replies; 12+ messages in thread
From: Miles Bader @ 2006-03-14  6:22 UTC (permalink / raw)


Zhang Wei <id.brep@gmail.com> writes:
> Perhaps the ultimate best solution to this problem is, at the system
> call level, filename is converted to utf-8 no matter what ever the
> locale a user choose. If it is hard to do in the kernel, at least we
> should do it at the libc level.

Neither is going to happen.  It's up to users to come to some agreement
on what they use.

-Miles
-- 
Occam's razor split hairs so well, I bought the whole argument!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Feature request: multibyte user-full-name
  2006-03-14  3:18   ` AIDA Shinra
  2006-03-14  4:54     ` Zhang Wei
  2006-03-14  5:44     ` Kenichi Handa
@ 2006-03-14 16:07     ` Stefan Monnier
  2006-03-20  4:47       ` Kenichi Handa
  2006-03-14 22:36     ` Stefan Monnier
  3 siblings, 1 reply; 12+ messages in thread
From: Stefan Monnier @ 2006-03-14 16:07 UTC (permalink / raw)
  Cc: emacs-devel, Kenichi Handa

>> Anyway, as far as a system allows users to switch locale, I
>> think, pw_gecos must adopt locale-independent encoding, thus
>> the possible encoding is one of UTF-*.  And, considering
>> backward compatibility, it should be UTF-8.  Then, how about
>> we always decode it by utf-8 (only if it contains a byte
>> with MSB set) while falling back to locale-coding-system
>> (invalid utf-8 sequence is found), and see if that works on
>> any systems?

That sounds like a reasonable plan.


        Stefan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Feature request: multibyte user-full-name
  2006-03-14  5:44     ` Kenichi Handa
@ 2006-03-14 16:17       ` AIDA Shinra
  0 siblings, 0 replies; 12+ messages in thread
From: AIDA Shinra @ 2006-03-14 16:17 UTC (permalink / raw)
  Cc: emacs-devel

> >> By the way, does the mis-decoding of user-full-name lead to
> >> any serious error?
> 
> > I can't determine your "serious" means but user-full-name is widely
> > used anyway.
> 
> I mean an unrecoverable error.  When Emacs incorrectly
> decodes user-full-name, if it can be recoverable by trying
> the same command again after setting a proper coding system
> to some variable, it may not be that serious error.

No unrecoverable errors as long as I know.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Feature request: multibyte user-full-name
  2006-03-14  3:18   ` AIDA Shinra
                       ` (2 preceding siblings ...)
  2006-03-14 16:07     ` Stefan Monnier
@ 2006-03-14 22:36     ` Stefan Monnier
  2006-03-15  7:46       ` Jan D.
  3 siblings, 1 reply; 12+ messages in thread
From: Stefan Monnier @ 2006-03-14 22:36 UTC (permalink / raw)
  Cc: emacs-devel, Kenichi Handa

> A site administrator might choose an encoding other than UTF-8 even if
> it is locale-dependent...

Nice theory.
Give the growing popularity of utf-8, I think we shouldn't even try
locale-coding-system (which risks false-positives).  Better try just utf-8
and if that fails, fall back on unibyte.

I think it's pretty likely that people who start *now* to added non-ASCII
chars in Gecos fields do it with utf-8.  As for those who've been doing it
for a while already with something else than utf-8: well, Emacs-21 read that
as unibyte and it was good enough that (almost) nobody complained, so
why change?


        Stefan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Feature request: multibyte user-full-name
  2006-03-14 22:36     ` Stefan Monnier
@ 2006-03-15  7:46       ` Jan D.
  2006-03-19  6:12         ` AIDA Shinra
  0 siblings, 1 reply; 12+ messages in thread
From: Jan D. @ 2006-03-15  7:46 UTC (permalink / raw)
  Cc: emacs-devel



Stefan Monnier wrote:

> I think it's pretty likely that people who start *now* to added non-ASCII
> chars in Gecos fields do it with utf-8.  As for those who've been doing it
> for a while already with something else than utf-8: well, Emacs-21 read that
> as unibyte and it was good enough that (almost) nobody complained, so
> why change?

I've had non-ASCII (ISO-8859-1 actually) characters in my gecos field for at 
least 10 years, but then again, I'm used to programs not working with anything 
but ASCII, so I haven't complained :-)

	Jan D.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Feature request: multibyte user-full-name
  2006-03-15  7:46       ` Jan D.
@ 2006-03-19  6:12         ` AIDA Shinra
  0 siblings, 0 replies; 12+ messages in thread
From: AIDA Shinra @ 2006-03-19  6:12 UTC (permalink / raw)
  Cc: emacs-devel

> 
> > I think it's pretty likely that people who start *now* to added non-ASCII
> > chars in Gecos fields do it with utf-8.  As for those who've been doing it
> > for a while already with something else than utf-8: well, Emacs-21 read that
> > as unibyte and it was good enough that (almost) nobody complained, so
> > why change?
> 
> I've had non-ASCII (ISO-8859-1 actually) characters in my gecos field for at 
> least 10 years, but then again, I'm used to programs not working with anything 
> but ASCII, so I haven't complained :-)

Now I support Handa's proposal. It will work in most real sites.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Feature request: multibyte user-full-name
  2006-03-14 16:07     ` Stefan Monnier
@ 2006-03-20  4:47       ` Kenichi Handa
  0 siblings, 0 replies; 12+ messages in thread
From: Kenichi Handa @ 2006-03-20  4:47 UTC (permalink / raw)
  Cc: emacs-devel, shinra


Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>> Anyway, as far as a system allows users to switch locale, I
>>> think, pw_gecos must adopt locale-independent encoding, thus
>>> the possible encoding is one of UTF-*.  And, considering
>>> backward compatibility, it should be UTF-8.  Then, how about
>>> we always decode it by utf-8 (only if it contains a byte
>>> with MSB set) while falling back to locale-coding-system
>>> (invalid utf-8 sequence is found), and see if that works on
>>> any systems?

> That sounds like a reasonable plan.

AIDA Shinra <shinra@j10n.org> writes:

> Now I support Handa's proposal. It will work in most real sites.

Sorry, but I've just noticed that utf-8 decoding doesn't
work in init_editfns () (which sets Vuser_full_name)
because, at this time, emacs/lisp/international is not yet
in Vload_path and thus loading of subst-ksc, etc fails if
the name contains CJK chars.  :-(

It may be possible to delay setting of user-full-name until
Vload_path is setup, but I'm not sure it's worth making that
kind of nontrivial change at this stage.

On the other hand, the same strategy can be implemented in
emacs-unicode-2 without such a problem.  So, I'm going to do
that in that branch.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2006-03-20  4:47 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-12  7:35 Feature request: multibyte user-full-name AIDA Shinra
2006-03-14  1:48 ` Kenichi Handa
2006-03-14  3:18   ` AIDA Shinra
2006-03-14  4:54     ` Zhang Wei
2006-03-14  6:22       ` Miles Bader
2006-03-14  5:44     ` Kenichi Handa
2006-03-14 16:17       ` AIDA Shinra
2006-03-14 16:07     ` Stefan Monnier
2006-03-20  4:47       ` Kenichi Handa
2006-03-14 22:36     ` Stefan Monnier
2006-03-15  7:46       ` Jan D.
2006-03-19  6:12         ` AIDA Shinra

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).