From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Peter Dyballa Newsgroups: gmane.emacs.help Subject: Re: how to find encoding violations in Emacs buffer? Date: Wed, 13 Dec 2006 11:45:21 +0100 Message-ID: References: <1165947493.201071.294760@l12g2000cwl.googlegroups.com> <1165999197.876027.315040@73g2000cwn.googlegroups.com> NNTP-Posting-Host: dough.gmane.org Mime-Version: 1.0 (Apple Message framework v752.2) Content-Type: text/plain; charset=ISO-8859-1; delsp=yes; format=flowed Content-Transfer-Encoding: quoted-printable X-Trace: sea.gmane.org 1166006767 22900 80.91.229.10 (13 Dec 2006 10:46:07 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Wed, 13 Dec 2006 10:46:07 +0000 (UTC) Cc: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Wed Dec 13 11:46:06 2006 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by dough.gmane.org with esmtp (Exim 4.50) id 1GuRce-0007Xt-4v for geh-help-gnu-emacs@m.gmane.org; Wed, 13 Dec 2006 11:46:00 +0100 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1GuRcd-0008QE-Jm for geh-help-gnu-emacs@m.gmane.org; Wed, 13 Dec 2006 05:45:59 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1GuRcA-0008Mz-Dr for help-gnu-emacs@gnu.org; Wed, 13 Dec 2006 05:45:30 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1GuRc8-0008Iu-Mk for help-gnu-emacs@gnu.org; Wed, 13 Dec 2006 05:45:28 -0500 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1GuRc7-0008Hm-HG for help-gnu-emacs@gnu.org; Wed, 13 Dec 2006 05:45:27 -0500 Original-Received: from [217.72.192.227] (helo=fmmailgate02.web.de) by monty-python.gnu.org with esmtp (Exim 4.52) id 1GuRc7-0005yP-G6 for help-gnu-emacs@gnu.org; Wed, 13 Dec 2006 05:45:27 -0500 Original-Received: from smtp06.web.de (fmsmtp06.dlan.cinetic.de [172.20.5.172]) by fmmailgate02.web.de (Postfix) with ESMTP id 0D0C04516660; Wed, 13 Dec 2006 11:45:24 +0100 (CET) Original-Received: from [87.193.11.203] (helo=[192.168.1.2]) by smtp06.web.de with asmtp (TLSv1:AES128-SHA:128) (WEB.DE 4.107 #114) id 1GuRc3-0005XB-00; Wed, 13 Dec 2006 11:45:23 +0100 In-Reply-To: <1165999197.876027.315040@73g2000cwn.googlegroups.com> Original-To: riccardo.murri@gmail.com X-Mailer: Apple Mail (2.752.2) X-Sender: Peter_Dyballa@web.de X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:39501 Archived-At: Am 13.12.2006 um 09:39 schrieb riccardo.murri@gmail.com: > Yes, but it may be hard to spot one single problematic character in a > large buffer. In the case at hand, I had one Latin-1 "=F9" in a 20k > UTF-8 text, This character is an UTF-8 entity: [=F9] 00F9 LATIN SMALL LETTER U WITH GRAVE It cannot be the cause. In UTF-8 it's encoded as C3 B9. Kermit and =20 Unicode Emacs 23 have a file UnicodeData.txt that describes a lot of =20 Unicode characters. A bit more complete is Kermit's utf8.txt from =20 which the above excerpt comes. > > Isn't there a way to implement a "goto-next-problematic-char" elisp > function? UTF-8 has a rather simple algorithm to detect encoding > violations, which can point at the precise point where a byte sequence > violates UTF-8 rules, but I wondered if Emacs had a more general > interface: if it knows where in the buffer the encoding violations > are located, one would assume that this information would be available > at elisp level. There is something like this already implemented in PostScript =20 printing: when the buffer contains characters outside a specific ISO =20 Latin encoding up to a dozen of them is presented in a warning buffer. -- Greetings Pete <\ \__ O __O | O\ _\\/\-% _`\<, '()-'-(_)--(_) (_)/(_)