From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Jesper Harder Newsgroups: gmane.emacs.bugs Subject: Re: detect-coding-string doesn't return all possibilities Date: Fri, 14 Mar 2003 05:34:32 +0100 Sender: bug-gnu-emacs-bounces+gnu-bug-gnu-emacs=m.gmane.org@gnu.org Message-ID: References: NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Trace: main.gmane.org 1047616590 28097 80.91.224.249 (14 Mar 2003 04:36:30 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Fri, 14 Mar 2003 04:36:30 +0000 (UTC) Cc: bug-gnu-emacs@gnu.org Original-X-From: bug-gnu-emacs-bounces+gnu-bug-gnu-emacs=m.gmane.org@gnu.org Fri Mar 14 05:36:28 2003 Return-path: Original-Received: from monty-python.gnu.org ([199.232.76.173]) by main.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 18tgvk-0007J3-00 for ; Fri, 14 Mar 2003 05:36:28 +0100 Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.10.13) id 18tgwI-0008KS-03 for gnu-bug-gnu-emacs@m.gmane.org; Thu, 13 Mar 2003 23:37:02 -0500 Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.10.13) id 18tgvq-0007X9-00 for bug-gnu-emacs@gnu.org; Thu, 13 Mar 2003 23:36:34 -0500 Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.10.13) id 18tgvP-0006YQ-00 for bug-gnu-emacs@gnu.org; Thu, 13 Mar 2003 23:36:08 -0500 Original-Received: from pfepa.post.tele.dk ([193.162.153.2]) by monty-python.gnu.org with esmtp (Exim 4.10.13) id 18tgvI-0005pM-00 for bug-gnu-emacs@gnu.org; Thu, 13 Mar 2003 23:36:00 -0500 Original-Received: from [195.249.130.111] (0xc3f9826f.esnxr3.ras.tele.dk [195.249.130.111]) by pfepa.post.tele.dk (Postfix) with ESMTP id AD3CF47FFC8; Fri, 14 Mar 2003 05:35:49 +0100 (CET) Original-To: Kenichi Handa In-Reply-To: (Kenichi Handa's message of "Thu, 13 Mar 2003 20:34:27 +0900 (JST)") User-Agent: Gnus/5.090016 (Oort Gnus v0.16) Emacs/21.3.50 (gnu/linux) X-BeenThere: bug-gnu-emacs@gnu.org X-Mailman-Version: 2.1b5 Precedence: list List-Id: Bug reports for GNU Emacs, the Swiss army knife of text editors List-Help: List-Post: List-Subscribe: , List-Archive: List-Unsubscribe: , Errors-To: bug-gnu-emacs-bounces+gnu-bug-gnu-emacs=m.gmane.org@gnu.org Xref: main.gmane.org gmane.emacs.bugs:4608 X-Report-Spam: http://spam.gmane.org/gmane.emacs.bugs:4608 --=-=-= Kenichi Handa writes: > detect-coding-string doesn't return all possible coding systems, but > returns a possible coding systems Emacs may automatically detect in > the current language environment. Ah, I see. Do you know of any other way to decide if using a given coding system for decoding a string would give a valid result? A function similar to this would be really useful: (defun possible-coding-system-for-string-p (str coding-system) "Return t if CODING-SYSTEM is a possible coding system for decoding STR." ...) The issue comes from a discussion on the Gnus development list (I've included one of the messages from that thread below). Gnus does not work very well when using CVS Emacs in an UTF-8 locale, because a lot of non-MIME capable clients don't include proper charset information. This causes Gnus to decode many Latin-1 strings as UTF-8. It would help a lot if we could detect that a string cannot possibly be encoded in UTF-8. I know that it's not always possible to distinguish, but just detecting strings that are invalid as UTF-8 would be very helpful. This doesn't just apply to UTF-8 but to any coding system, of course. > But the docstring of detect-coding-system is surely not > good. I've just changed the first paragraph as this. How > is it? It's good. --=-=-= Content-Type: message/rfc822 Content-Disposition: inline Path: quimby.gnus.org!not-for-mail From: Jesper Harder Newsgroups: gnus.ding Subject: Re: charset=macintosh Date: Sun, 09 Mar 2003 04:56:44 +0100 Organization: http://purl.org/harder/ Lines: 48 Approved: auto Message-ID: References: <843clxud7u.fsf@lucy.is.informatik.uni-duisburg.de> User-Agent: Gnus/5.090016 (Oort Gnus v0.16) Emacs/21.3.50 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=euc-kr Content-Transfer-Encoding: quoted-printable Simon Josefsson writes: > But what if you are saying about UTF-8 clients being MIME capable is > true, and since UTF-8 is typically never preferred by current emacsen, > doesn't emacs' current guessing works the best we can hope for? > Doesn't it detect among ISO-8859-X, ISO-2022 and Big5 properly? No. I was hoping we could do something like this (for headers): (let ((coding-systems (detect-coding-string string))) (if (memq default coding-systems) (decode-coding-string string default) (decode-coding-string string (car coding-systems)))) i.e. if the default coding system is valid for the string, then use that; otherwise use whatever Emacs thinks is the most likely coding system. I think this would be ideal. But unfortunately `detect-coding-string' _doesn't_ return a complete list of possible coding systems. Consider this scenario:=20 I'm using Emacs in a Latin-1 locale. dk.* newsgroups work fine because latin-1 is the default. But I also subscribe to, say, a few Korean newsgroups. The entry in `gnus-groups-charset-alist': ("\\(^\\|:\\)han\\>" euc-kr) should take care of selecting the proper default charset. But *oops*, `detect-coding-string' doesn't think that euc-kr is a possible charset for a Korean string encoded in euc-kr: (detect-coding-string (encode-coding-string "=BE=C8=B3=E7" 'euc-kr)) =3D> (iso-latin-1 iso-latin-1 raw-text japanese-shift-jis=20 chinese-big5 no-conversion) So the above approach would fail. > 2) Users with emacs in UTF-8 prefers UTF-8 too often, even when the > data is invalid UTF-8 and another encoding should be selected. > > The second situation is a bug, and I hope we can fix this. Yep, 2) is the most serious problem. Especially because more and more people are (often unknowingly) using an UTF-8 locale because Redhat 8 switched to UTF-8 by default. Those people would experience Gnus as broken when reading hierarchies like dk.* or de.*. --=-=-= Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit _______________________________________________ Bug-gnu-emacs mailing list Bug-gnu-emacs@gnu.org http://mail.gnu.org/mailman/listinfo/bug-gnu-emacs --=-=-=--