undecided vs utf-8

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* undecided vs utf-8
@ 2010-11-04 22:27 Lars Magne Ingebrigtsen
  2010-11-04 22:40 ` Lars Magne Ingebrigtsen
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-11-04 22:27 UTC (permalink / raw)
  To: emacs-devel

When using erc, it decodes iso-8859-1 fine with the default `undecided'
into encoding.  However, any utf-8 strings are, sort of, just translated
into the same coding system:

(decode-coding-string "u-te-\303\246ff \303\245tte" 'undecided)
=> "u-te-Ã¦ff Ã¥tte"

(decode-coding-string "u-te-\303\246ff \303\245tte" 'utf-8)
=> "u-te-æff åtte"

So, uhm...  Is this meant to be this way?  I know that guessing the
first thing is, well, correct, sort of -- it's valid iso-8859-1,
although very strange.  But it's also valid utf-8.  Shouldn't
`decode-coding-string' prefer utf-8 if it's actually valid?  If it's
valid utf-8, then it's quite likely that it's meant to be utf-8, even
though other coding systems are also possible.

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: undecided vs utf-8
  2010-11-04 22:27 undecided vs utf-8 Lars Magne Ingebrigtsen
@ 2010-11-04 22:40 ` Lars Magne Ingebrigtsen
  2010-11-05  0:02   ` Stefan Monnier
  2010-11-05  2:01 ` Kenichi Handa
  2010-11-05  7:56 ` Eli Zaretskii
  2 siblings, 1 reply; 14+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-11-04 22:40 UTC (permalink / raw)
  To: emacs-devel

After rooting through this rather confusing system, I guess what I'm
saying is that doing

(set-coding-system-priority 'utf-8)

might, perhaps, make sense?  As a default?

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: undecided vs utf-8
  2010-11-04 22:40 ` Lars Magne Ingebrigtsen
@ 2010-11-05  0:02   ` Stefan Monnier
  2010-11-05  8:01     ` Eli Zaretskii
  0 siblings, 1 reply; 14+ messages in thread
From: Stefan Monnier @ 2010-11-05  0:02 UTC (permalink / raw)
  To: emacs-devel

> After rooting through this rather confusing system, I guess what I'm
> saying is that doing
> (set-coding-system-priority 'utf-8)
> might, perhaps, make sense?  As a default?

I'd tend to agree.


        Stefan



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: undecided vs utf-8
  2010-11-04 22:27 undecided vs utf-8 Lars Magne Ingebrigtsen
  2010-11-04 22:40 ` Lars Magne Ingebrigtsen
@ 2010-11-05  2:01 ` Kenichi Handa
  2010-11-05  2:32   ` Lars Magne Ingebrigtsen
  2010-11-05  7:56 ` Eli Zaretskii
  2 siblings, 1 reply; 14+ messages in thread
From: Kenichi Handa @ 2010-11-05  2:01 UTC (permalink / raw)
  To: Lars Magne Ingebrigtsen; +Cc: emacs-devel

In article <m3d3qkpvv6.fsf@quimbies.gnus.org>, Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> When using erc, it decodes iso-8859-1 fine with the default `undecided'
> into encoding.  However, any utf-8 strings are, sort of, just translated
> into the same coding system:

> (decode-coding-string "u-te-\303\246ff \303\245tte" 'undecided)
>>> "u-te-Ã¦ff Ã¥tte"

It's perhaps because you are in some of iso-8859-1 locale.
As I'm in ja_JP.UTF-8 locale, the above is decoded by utf-8.

> (decode-coding-string "u-te-\303\246ff \303\245tte" 'utf-8)
>>> "u-te-æff åtte"

> So, uhm...  Is this meant to be this way?  I know that guessing the
> first thing is, well, correct, sort of -- it's valid iso-8859-1,
> although very strange.  But it's also valid utf-8.  Shouldn't
> `decode-coding-string' prefer utf-8 if it's actually valid?  If it's
> valid utf-8, then it's quite likely that it's meant to be utf-8, even
> though other coding systems are also possible.

I don't want to add such a heuristic in
decode-coding-string/region (the lowest functions available
from Lisp).  Please note that above sequence is also valid
as Big5.  If people are in Big5 locale, it's hard to answer
which of utf-8 or big5 is preferred unless we implement NLP
system.

Perhaps making an upper layer function that will accept a
list of preferred coding systems will be good; something
like this.

(defun detect-and-decode-coding-string (str preferred)
  (let ((detected (detect-coding-string str))
	decided)
    (while (and preferred (not decided)) 
      (if (memq (car preferred) detected)
	  (setq decided (car preferred))
	(setq preferred (cdr preferred))))
    (decode-coding-string str (or decided (car detected)))))

---
Kenichi Handa
handa@m17n.org



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: undecided vs utf-8
  2010-11-05  2:01 ` Kenichi Handa
@ 2010-11-05  2:32   ` Lars Magne Ingebrigtsen
  2010-11-05  4:42     ` Kenichi Handa
                       ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-11-05  2:32 UTC (permalink / raw)
  To: emacs-devel

Kenichi Handa <handa@m17n.org> writes:

> It's perhaps because you are in some of iso-8859-1 locale.

I don't think I am, but I might be wrong.  There are so many locale
variables, but I always try to put my machines into "C" locale.

> I don't want to add such a heuristic in
> decode-coding-string/region (the lowest functions available
> from Lisp).  Please note that above sequence is also valid
> as Big5.  If people are in Big5 locale, it's hard to answer
> which of utf-8 or big5 is preferred unless we implement NLP
> system.

I don't know how the big5 encoding looks like, but when it comes to
iso-8859-1 vs utf-8, then there are many utf-8 strings that are valid
iso-8859-1 strings, but there are few iso-8859-1 strings that are valid
utf-8 strings.  Therefore it seems to make sense to prefer utf-8 over
iso-8859-1.  Perhaps.

> Perhaps making an upper layer function that will accept a
> list of preferred coding systems will be good; something
> like this.
>
> (defun detect-and-decode-coding-string (str preferred)
>   (let ((detected (detect-coding-string str))
> 	decided)
>     (while (and preferred (not decided)) 
>       (if (memq (car preferred) detected)
> 	  (setq decided (car preferred))
> 	(setq preferred (cdr preferred))))
>     (decode-coding-string str (or decided (car detected)))))

Well, this is about `undecided', and the C layer does DWIM-ish
processing when you ask it to decode `undecided', doesn't it?

The use case that made me look into this -- erc -- is somewhat special.
The irc protocol does no charset tagging, and some clients send some
charsets, and some send others, which is why erc uses `undecided' as the
default coding system.  Typically on a channel you'll see somebody using
a local (iso-8859-* is popular) charset, and others using utf-8.

Perhaps the fix here isn't to do anything with `undecided' per se, but
just fix erc.  It's trivial enough -- just have the default be, say,
`undecided-or-utf-8', and then handle that by running
`detect-coding-string' over it, see whether it's utf-8, and then either
use that or pass `undecided' down into the decoding functions.

I don't know.  What do you think?

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: undecided vs utf-8
  2010-11-05  2:32   ` Lars Magne Ingebrigtsen
@ 2010-11-05  4:42     ` Kenichi Handa
  2010-11-05 13:02       ` Lars Magne Ingebrigtsen
  2010-11-05  8:09     ` Eli Zaretskii
  2010-11-05  8:10     ` Eli Zaretskii
  2 siblings, 1 reply; 14+ messages in thread
From: Kenichi Handa @ 2010-11-05  4:42 UTC (permalink / raw)
  To: Lars Magne Ingebrigtsen; +Cc: emacs-devel

In article <m362wco5zx.fsf@quimbies.gnus.org>, Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> Kenichi Handa <handa@m17n.org> writes:
> > It's perhaps because you are in some of iso-8859-1 locale.

> I don't think I am, but I might be wrong.  There are so many locale
> variables, but I always try to put my machines into "C" locale.

??? When the locale is "C", emacs prefers utf-8 the most.

% LANG=C emacs -Q -batch --eval '(message "%s" (car (coding-system-priority-list)))'

should prints utf-8.

> > I don't want to add such a heuristic in
> > decode-coding-string/region (the lowest functions available
> > from Lisp).  Please note that above sequence is also valid
> > as Big5.  If people are in Big5 locale, it's hard to answer
> > which of utf-8 or big5 is preferred unless we implement NLP
> > system.

> I don't know how the big5 encoding looks like, but when it comes to
> iso-8859-1 vs utf-8, then there are many utf-8 strings that are valid
> iso-8859-1 strings, but there are few iso-8859-1 strings that are valid
> utf-8 strings.  Therefore it seems to make sense to prefer utf-8 over
> iso-8859-1.  Perhaps.

Please consider the reason why one is in iso-8859-1 locale
nowadays.  Isn't it because he prefers iso-8859-1 orver
utf-8?

> > Perhaps making an upper layer function that will accept a
> > list of preferred coding systems will be good; something
> > like this.
> >
> > (defun detect-and-decode-coding-string (str preferred)
> >   (let ((detected (detect-coding-string str))
> > 	decided)
> >     (while (and preferred (not decided)) 
> >       (if (memq (car preferred) detected)
> > 	  (setq decided (car preferred))
> > 	(setq preferred (cdr preferred))))
> >     (decode-coding-string str (or decided (car detected)))))

> Well, this is about `undecided', and the C layer does DWIM-ish
> processing when you ask it to decode `undecided', doesn't it?

I don't know which Emacs' behaviour you describe as DWIM-ish.

> The use case that made me look into this -- erc -- is somewhat special.
> The irc protocol does no charset tagging, and some clients send some
> charsets, and some send others, which is why erc uses `undecided' as the
> default coding system.  Typically on a channel you'll see somebody using
> a local (iso-8859-* is popular) charset, and others using utf-8.

> Perhaps the fix here isn't to do anything with `undecided' per se, but
> just fix erc.  It's trivial enough -- just have the default be, say,
> `undecided-or-utf-8', and then handle that by running
> `detect-coding-string' over it, see whether it's utf-8, and then either
> use that or pass `undecided' down into the decoding functions.

> I don't know.  What do you think?

I think the best way is to provide users an easy way to
specify a correct coding-system when they see a decoding
error as well as the method to customize the default
coding-system for erc.

---
Kenichi Handa
handa@m17n.org



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: undecided vs utf-8
  2010-11-04 22:27 undecided vs utf-8 Lars Magne Ingebrigtsen
  2010-11-04 22:40 ` Lars Magne Ingebrigtsen
  2010-11-05  2:01 ` Kenichi Handa
@ 2010-11-05  7:56 ` Eli Zaretskii
  2 siblings, 0 replies; 14+ messages in thread
From: Eli Zaretskii @ 2010-11-05  7:56 UTC (permalink / raw)
  To: Lars Magne Ingebrigtsen; +Cc: emacs-devel

> From: Lars Magne Ingebrigtsen <larsi@gnus.org>
> Date: Thu, 04 Nov 2010 23:27:57 +0100
> 
> When using erc, it decodes iso-8859-1 fine with the default `undecided'
> into encoding.  However, any utf-8 strings are, sort of, just translated
> into the same coding system:
> 
> (decode-coding-string "u-te-\303\246ff \303\245tte" 'undecided)
> => "u-te-Ã¦ff Ã¥tte"
> 
> (decode-coding-string "u-te-\303\246ff \303\245tte" 'utf-8)
> => "u-te-æff åtte"

Please show the output of "M-x mule-diag RET" on the machine where
this happens.

> Shouldn't `decode-coding-string' prefer utf-8 if it's actually
> valid?

Depending on the user's locale and preferences, this could easily
backfire, especially if the text is insufficiently long to distinguish
between the two.

Using incorrect decoder in a small fraction of cases is a fact of
life; every program out there hits this from time to time.  What we
need is good defaults, and ways to customize those in specific
situations.  In this case, perhaps erc should use its own defaults, if
UTF-8 is widely (or solely) used there.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: undecided vs utf-8
  2010-11-05  0:02   ` Stefan Monnier
@ 2010-11-05  8:01     ` Eli Zaretskii
  0 siblings, 0 replies; 14+ messages in thread
From: Eli Zaretskii @ 2010-11-05  8:01 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Thu, 04 Nov 2010 20:02:15 -0400
> 
> > After rooting through this rather confusing system, I guess what I'm
> > saying is that doing
> > (set-coding-system-priority 'utf-8)
> > might, perhaps, make sense?  As a default?
> 
> I'd tend to agree.

Please don't.  The current defaults took a very long time to get to,
and they generally do a good job.  So let's not destabilize Emacs in
this area by changing such global defaults so unconditionally, based
on a single use-case of a single program.  At least not before we
understand well why Emacs didn't DTRT in that use-case.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: undecided vs utf-8
  2010-11-05  2:32   ` Lars Magne Ingebrigtsen
  2010-11-05  4:42     ` Kenichi Handa
@ 2010-11-05  8:09     ` Eli Zaretskii
  2010-11-05 13:06       ` Lars Magne Ingebrigtsen
  2010-11-05  8:10     ` Eli Zaretskii
  2 siblings, 1 reply; 14+ messages in thread
From: Eli Zaretskii @ 2010-11-05  8:09 UTC (permalink / raw)
  To: Lars Magne Ingebrigtsen; +Cc: emacs-devel

> From: Lars Magne Ingebrigtsen <larsi@gnus.org>
> Date: Fri, 05 Nov 2010 03:32:02 +0100
> 
> Kenichi Handa <handa@m17n.org> writes:
> 
> > It's perhaps because you are in some of iso-8859-1 locale.
> 
> I don't think I am, but I might be wrong.  There are so many locale
> variables, but I always try to put my machines into "C" locale.

"M-x mule-diag RET" will show.

> I don't know how the big5 encoding looks like, but when it comes to
> iso-8859-1 vs utf-8, then there are many utf-8 strings that are valid
> iso-8859-1 strings, but there are few iso-8859-1 strings that are valid
> utf-8 strings.  Therefore it seems to make sense to prefer utf-8 over
> iso-8859-1.  Perhaps.

It will replace one non-perfect heuristics with another.  Each one of
them fails sometimes, and when you hit that one use-case, it doesn't
comfort you whether you are in the 0.5% of losers or in 0.1%.

This is one use-case.  Let's investigate it thoroughly before we
ponder the possibility of changing global defaults for everyone.

> Well, this is about `undecided', and the C layer does DWIM-ish
> processing when you ask it to decode `undecided', doesn't it?

No.  There's no DWIM-like behavior in how Emacs guesses under
`undecided'.  It goes by the priority list and uses the first encoding
that can decode all the characters in the input text.  This process is
completely driven by the priority list, it does not consider anything
else.  The DWIM parts are those which set the priority list given your
preferences and the locale.

> The use case that made me look into this -- erc -- is somewhat special.
> The irc protocol does no charset tagging, and some clients send some
> charsets, and some send others, which is why erc uses `undecided' as the
> default coding system.  Typically on a channel you'll see somebody using
> a local (iso-8859-* is popular) charset, and others using utf-8.

This means that even if we do want to change the priority list, it
should only be done for erc.  The global defaults do not need to be
touched.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: undecided vs utf-8
  2010-11-05  2:32   ` Lars Magne Ingebrigtsen
  2010-11-05  4:42     ` Kenichi Handa
  2010-11-05  8:09     ` Eli Zaretskii
@ 2010-11-05  8:10     ` Eli Zaretskii
  2010-11-05 12:28       ` Deniz Dogan
  2010-11-05 12:50       ` Lars Magne Ingebrigtsen
  2 siblings, 2 replies; 14+ messages in thread
From: Eli Zaretskii @ 2010-11-05  8:10 UTC (permalink / raw)
  To: Lars Magne Ingebrigtsen; +Cc: emacs-devel

> From: Lars Magne Ingebrigtsen <larsi@gnus.org>
> Date: Fri, 05 Nov 2010 03:32:02 +0100
> 
> The use case that made me look into this -- erc -- is somewhat special.
> The irc protocol does no charset tagging, and some clients send some
> charsets, and some send others, which is why erc uses `undecided' as the
> default coding system.  Typically on a channel you'll see somebody using
> a local (iso-8859-* is popular) charset, and others using utf-8.

How do other erc clients solve this problem?



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: undecided vs utf-8
  2010-11-05  8:10     ` Eli Zaretskii
@ 2010-11-05 12:28       ` Deniz Dogan
  2010-11-05 12:50       ` Lars Magne Ingebrigtsen
  1 sibling, 0 replies; 14+ messages in thread
From: Deniz Dogan @ 2010-11-05 12:28 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Lars Magne Ingebrigtsen, emacs-devel

2010/11/5 Eli Zaretskii <eliz@gnu.org>:
>> From: Lars Magne Ingebrigtsen <larsi@gnus.org>
>> Date: Fri, 05 Nov 2010 03:32:02 +0100
>>
>> The use case that made me look into this -- erc -- is somewhat special.
>> The irc protocol does no charset tagging, and some clients send some
>> charsets, and some send others, which is why erc uses `undecided' as the
>> default coding system.  Typically on a channel you'll see somebody using
>> a local (iso-8859-* is popular) charset, and others using utf-8.
>
> How do other erc clients solve this problem?
>
>

rcirc uses utf-8 for both encoding and decoding by default, letting
the user override these settings for specific channels using
rcirc-coding-system-alist.

-- 
Deniz Dogan



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: undecided vs utf-8
  2010-11-05  8:10     ` Eli Zaretskii
  2010-11-05 12:28       ` Deniz Dogan
@ 2010-11-05 12:50       ` Lars Magne Ingebrigtsen
  1 sibling, 0 replies; 14+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-11-05 12:50 UTC (permalink / raw)
  To: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

> How do other erc clients solve this problem?

I haven't used other clients than zenirc and erc, but I queried the
channel I'm on.

It seems like older clients just let you set one charset, and that's
used for decoding all the strings, so it somebody uses the wrong charset
(and you're on an ornery channel) everybody gets upset.  ("Charset
wars.")

Newer clients apparently just check for utf-8-ness first, and if it
looks like utf-8, it's decoded as utf-8.  If not, the per-channel
charset setting is used.

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: undecided vs utf-8
  2010-11-05  4:42     ` Kenichi Handa
@ 2010-11-05 13:02       ` Lars Magne Ingebrigtsen
  0 siblings, 0 replies; 14+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-11-05 13:02 UTC (permalink / raw)
  To: emacs-devel

Kenichi Handa <handa@m17n.org> writes:

> ??? When the locale is "C", emacs prefers utf-8 the most.
>
> % LANG=C emacs -Q -batch --eval '(message "%s" (car
> (coding-system-priority-list)))'
>
> should prints utf-8.

Let's see...

[larsi@quimbies ~/pgnus]$ echo $LANG
en_US

Right.  Well, I have no idea where that came from...

>> Well, this is about `undecided', and the C layer does DWIM-ish
>> processing when you ask it to decode `undecided', doesn't it?
>
> I don't know which Emacs' behaviour you describe as DWIM-ish.

Uhm...  if you ask a computer to do something vague, and the computer
does this based on settings the user is barely cognisant of, that's the
essence of DWIM.  In this case, it's looking at a string to try to guess
what the charset is ("do something vague"), and pick an answer from the
many possible based on settings the user is barely cognisant of.

I don't really see why you don't think this isn't DWIM...

> I think the best way is to provide users an easy way to
> specify a correct coding-system when they see a decoding
> error as well as the method to customize the default
> coding-system for erc.

So erc should pop up a message saying "even though this looks like
utf-8, I can tell from your locale spec that you're not expecting this,
so here's a list of 45 different charsets to choose from that the string
might be in, and I'll ask you this question for every IRC string until
you set a precedence list of charsets which I'll then force you to
choose between because utf-8 and iso-8859-1 isn't distinguishable
anyway"?  It somehow doesn't seem maximally user-friendly.

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: undecided vs utf-8
  2010-11-05  8:09     ` Eli Zaretskii
@ 2010-11-05 13:06       ` Lars Magne Ingebrigtsen
  0 siblings, 0 replies; 14+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-11-05 13:06 UTC (permalink / raw)
  To: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

> This means that even if we do want to change the priority list, it
> should only be done for erc.  The global defaults do not need to be
> touched.

Yeah, I think so, too.  The irc use case is kinda unusual -- lots of
small strings that you either have to guess at what the charset is, or
establish social conventions per channel.  I can't, off the top of my
head, remember any other common situations where that happens...

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2010-11-05 13:06 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-04 22:27 undecided vs utf-8 Lars Magne Ingebrigtsen
2010-11-04 22:40 ` Lars Magne Ingebrigtsen
2010-11-05  0:02   ` Stefan Monnier
2010-11-05  8:01     ` Eli Zaretskii
2010-11-05  2:01 ` Kenichi Handa
2010-11-05  2:32   ` Lars Magne Ingebrigtsen
2010-11-05  4:42     ` Kenichi Handa
2010-11-05 13:02       ` Lars Magne Ingebrigtsen
2010-11-05  8:09     ` Eli Zaretskii
2010-11-05 13:06       ` Lars Magne Ingebrigtsen
2010-11-05  8:10     ` Eli Zaretskii
2010-11-05 12:28       ` Deniz Dogan
2010-11-05 12:50       ` Lars Magne Ingebrigtsen
2010-11-05  7:56 ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).