From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Kenichi Handa Newsgroups: gmane.emacs.devel Subject: Re: undecided vs utf-8 Date: Fri, 05 Nov 2010 13:42:29 +0900 Message-ID: References: NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain X-Trace: dough.gmane.org 1288932172 11497 80.91.229.12 (5 Nov 2010 04:42:52 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Fri, 5 Nov 2010 04:42:52 +0000 (UTC) Cc: emacs-devel@gnu.org To: Lars Magne Ingebrigtsen Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Nov 05 05:42:48 2010 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1PEE8W-0001xW-C0 for ged-emacs-devel@m.gmane.org; Fri, 05 Nov 2010 05:42:48 +0100 Original-Received: from localhost ([127.0.0.1]:45925 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PEE8V-0000Sv-Q9 for ged-emacs-devel@m.gmane.org; Fri, 05 Nov 2010 00:42:47 -0400 Original-Received: from [140.186.70.92] (port=35141 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PEE8N-0000Sg-Qd for emacs-devel@gnu.org; Fri, 05 Nov 2010 00:42:43 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1PEE8J-0007Pp-R7 for emacs-devel@gnu.org; Fri, 05 Nov 2010 00:42:39 -0400 Original-Received: from mx1.aist.go.jp ([150.29.246.133]:46381) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1PEE8J-0007PE-C9 for emacs-devel@gnu.org; Fri, 05 Nov 2010 00:42:35 -0400 Original-Received: from rqsmtp1.aist.go.jp (rqsmtp1.aist.go.jp [150.29.254.115]) by mx1.aist.go.jp with ESMTP id oA54gVp5017238; Fri, 5 Nov 2010 13:42:31 +0900 (JST) env-from (handa@m17n.org) Original-Received: from smtp3.aist.go.jp by rqsmtp1.aist.go.jp with ESMTP id oA54gVYZ004262; Fri, 5 Nov 2010 13:42:31 +0900 (JST) env-from (handa@m17n.org) Original-Received: by smtp3.aist.go.jp with ESMTP id oA54gT1T007684; Fri, 5 Nov 2010 13:42:29 +0900 (JST) env-from (handa@m17n.org) Original-Received: from handa by etlken with local (Exim 4.71) (envelope-from ) id 1PEE8D-00023a-ET; Fri, 05 Nov 2010 13:42:29 +0900 In-Reply-To: (message from Lars Magne Ingebrigtsen on Fri, 05 Nov 2010 03:32:02 +0100) X-detected-operating-system: by eggs.gnu.org: Solaris 9 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:132387 Archived-At: In article , Lars Magne Ingebrigtsen writes: > Kenichi Handa writes: > > It's perhaps because you are in some of iso-8859-1 locale. > I don't think I am, but I might be wrong. There are so many locale > variables, but I always try to put my machines into "C" locale. ??? When the locale is "C", emacs prefers utf-8 the most. % LANG=C emacs -Q -batch --eval '(message "%s" (car (coding-system-priority-list)))' should prints utf-8. > > I don't want to add such a heuristic in > > decode-coding-string/region (the lowest functions available > > from Lisp). Please note that above sequence is also valid > > as Big5. If people are in Big5 locale, it's hard to answer > > which of utf-8 or big5 is preferred unless we implement NLP > > system. > I don't know how the big5 encoding looks like, but when it comes to > iso-8859-1 vs utf-8, then there are many utf-8 strings that are valid > iso-8859-1 strings, but there are few iso-8859-1 strings that are valid > utf-8 strings. Therefore it seems to make sense to prefer utf-8 over > iso-8859-1. Perhaps. Please consider the reason why one is in iso-8859-1 locale nowadays. Isn't it because he prefers iso-8859-1 orver utf-8? > > Perhaps making an upper layer function that will accept a > > list of preferred coding systems will be good; something > > like this. > > > > (defun detect-and-decode-coding-string (str preferred) > > (let ((detected (detect-coding-string str)) > > decided) > > (while (and preferred (not decided)) > > (if (memq (car preferred) detected) > > (setq decided (car preferred)) > > (setq preferred (cdr preferred)))) > > (decode-coding-string str (or decided (car detected))))) > Well, this is about `undecided', and the C layer does DWIM-ish > processing when you ask it to decode `undecided', doesn't it? I don't know which Emacs' behaviour you describe as DWIM-ish. > The use case that made me look into this -- erc -- is somewhat special. > The irc protocol does no charset tagging, and some clients send some > charsets, and some send others, which is why erc uses `undecided' as the > default coding system. Typically on a channel you'll see somebody using > a local (iso-8859-* is popular) charset, and others using utf-8. > Perhaps the fix here isn't to do anything with `undecided' per se, but > just fix erc. It's trivial enough -- just have the default be, say, > `undecided-or-utf-8', and then handle that by running > `detect-coding-string' over it, see whether it's utf-8, and then either > use that or pass `undecided' down into the decoding functions. > I don't know. What do you think? I think the best way is to provide users an easy way to specify a correct coding-system when they see a decoding error as well as the method to customize the default coding-system for erc. --- Kenichi Handa handa@m17n.org