From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Kenichi Handa Newsgroups: gmane.emacs.devel Subject: Re: undecided vs utf-8 Date: Fri, 05 Nov 2010 11:01:58 +0900 Message-ID: References: NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: dough.gmane.org 1288922544 14256 80.91.229.12 (5 Nov 2010 02:02:24 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Fri, 5 Nov 2010 02:02:24 +0000 (UTC) Cc: emacs-devel@gnu.org To: Lars Magne Ingebrigtsen Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Nov 05 03:02:20 2010 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1PEBdD-0004eE-9J for ged-emacs-devel@m.gmane.org; Fri, 05 Nov 2010 03:02:19 +0100 Original-Received: from localhost ([127.0.0.1]:38791 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PEBdD-0002Cq-3a for ged-emacs-devel@m.gmane.org; Thu, 04 Nov 2010 22:02:19 -0400 Original-Received: from [140.186.70.92] (port=40067 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PEBd6-0002Bb-4C for emacs-devel@gnu.org; Thu, 04 Nov 2010 22:02:15 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1PEBd1-0004F8-28 for emacs-devel@gnu.org; Thu, 04 Nov 2010 22:02:11 -0400 Original-Received: from mx1.aist.go.jp ([150.29.246.133]:52884) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1PEBd0-0004Eo-HJ for emacs-devel@gnu.org; Thu, 04 Nov 2010 22:02:07 -0400 Original-Received: from rqsmtp2.aist.go.jp (rqsmtp2.aist.go.jp [150.29.254.123]) by mx1.aist.go.jp with ESMTP id oA521xdr009848; Fri, 5 Nov 2010 11:01:59 +0900 (JST) env-from (handa@m17n.org) Original-Received: from smtp2.aist.go.jp by rqsmtp2.aist.go.jp with ESMTP id oA521xjb003781; Fri, 5 Nov 2010 11:01:59 +0900 (JST) env-from (handa@m17n.org) Original-Received: by smtp2.aist.go.jp with ESMTP id oA521xAR000030; Fri, 5 Nov 2010 11:01:59 +0900 (JST) env-from (handa@m17n.org) Original-Received: from handa by etlken with local (Exim 4.71) (envelope-from ) id 1PEBcs-0001ri-Uo; Fri, 05 Nov 2010 11:01:58 +0900 In-Reply-To: (message from Lars Magne Ingebrigtsen on Thu, 04 Nov 2010 23:27:57 +0100) X-detected-operating-system: by eggs.gnu.org: Solaris 9 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:132385 Archived-At: In article , Lars Magne Ingebrigtsen writes: > When using erc, it decodes iso-8859-1 fine with the default `undecided' > into encoding. However, any utf-8 strings are, sort of, just translated > into the same coding system: > (decode-coding-string "u-te-\303\246ff \303\245tte" 'undecided) >>> "u-te-=C3=A6ff =C3=A5tte" It's perhaps because you are in some of iso-8859-1 locale. As I'm in ja_JP.UTF-8 locale, the above is decoded by utf-8. > (decode-coding-string "u-te-\303\246ff \303\245tte" 'utf-8) >>> "u-te-=E6ff =E5tte" > So, uhm... Is this meant to be this way? I know that guessing the > first thing is, well, correct, sort of -- it's valid iso-8859-1, > although very strange. But it's also valid utf-8. Shouldn't > `decode-coding-string' prefer utf-8 if it's actually valid? If it's > valid utf-8, then it's quite likely that it's meant to be utf-8, even > though other coding systems are also possible. I don't want to add such a heuristic in decode-coding-string/region (the lowest functions available from Lisp). Please note that above sequence is also valid as Big5. If people are in Big5 locale, it's hard to answer which of utf-8 or big5 is preferred unless we implement NLP system. Perhaps making an upper layer function that will accept a list of preferred coding systems will be good; something like this. (defun detect-and-decode-coding-string (str preferred) (let ((detected (detect-coding-string str)) decided) (while (and preferred (not decided))=20 (if (memq (car preferred) detected) (setq decided (car preferred)) (setq preferred (cdr preferred)))) (decode-coding-string str (or decided (car detected))))) --- Kenichi Handa handa@m17n.org