From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: harven <harven@free.fr>
Newsgroups: gmane.emacs.help
Subject: Re: Finding and mapping all UTF-8 characters
Date: Sat, 05 Dec 2009 21:29:47 +0100
Organization: http://groups.google.com
Message-ID: <m2ocmdb1vo.fsf@free.fr>
References: <da535e83-3eec-429c-b63e-f304cc1f2dd3@n13g2000vbe.googlegroups.com>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: ger.gmane.org 1260045689 374 80.91.229.12 (5 Dec 2009 20:41:29 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Sat, 5 Dec 2009 20:41:29 +0000 (UTC)
To: help-gnu-emacs@gnu.org
Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Sat Dec 05 21:41:22 2009
Return-path: <help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org>
Envelope-to: geh-help-gnu-emacs@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by lo.gmane.org with esmtp (Exim 4.50)
	id 1NH1RQ-0007qV-JJ
	for geh-help-gnu-emacs@m.gmane.org; Sat, 05 Dec 2009 21:41:20 +0100
Original-Received: from localhost ([127.0.0.1]:48038 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1NH1RQ-0006GJ-E2
	for geh-help-gnu-emacs@m.gmane.org; Sat, 05 Dec 2009 15:41:20 -0500
Original-Path: news.stanford.edu!usenet.stanford.edu!news.tele.dk!news.tele.dk!small.news.tele.dk!tiscali!newsfeed1.ip.tiscali.net!proxad.net!feeder1-2.proxad.net!cleanfeed3-a.proxad.net!nnrp15-2.free.fr!not-for-mail
Original-Newsgroups: gnu.emacs.help
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.1 (darwin)
Cancel-Lock: sha1:/BdsqdE//ENOgnyLt4wZIkDxiog=
Original-Lines: 35
Original-NNTP-Posting-Date: 05 Dec 2009 21:29:49 MET
Original-NNTP-Posting-Host: 78.233.232.132
Original-X-Trace: 1260044989 news-3.free.fr 11222 78.233.232.132:51286
Original-X-Complaints-To: abuse@proxad.net
Original-Xref: news.stanford.edu gnu.emacs.help:175371
X-BeenThere: help-gnu-emacs@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Users list for the GNU Emacs text editor <help-gnu-emacs.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/help-gnu-emacs>,
	<mailto:help-gnu-emacs-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/help-gnu-emacs>
List-Post: <mailto:help-gnu-emacs@gnu.org>
List-Help: <mailto:help-gnu-emacs-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/help-gnu-emacs>,
	<mailto:help-gnu-emacs-request@gnu.org?subject=subscribe>
Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org
Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.help:70447
Archived-At: <http://permalink.gmane.org/gmane.emacs.help/70447>

deech <aditya.siram@gmail.com> writes:

> Hi all,
> I recently cut-and-pasted large chunks of text into an HTML document.
> When I tried to save the document I was warned that it was ISO-Latin
> but there were UTF-8 characters in the text.

The warning actually contains a list of these characters, and you can click
on them to see where they are located in the buffer.

> Is there a way to (1) search for the UTF-8 encoded characters in a
> document and (2) map them to a sensible ASCII character?
>
> Thanks ...
> -deech

Instead of converting to latin-1, it is probably better to save the file
in another coding system. Just do
M-x set-buffer-file-coding-system RET utf-8 RET

On the other hand, if you were surprised by the unicode characters,
then this probably means that there are few of them. Have a look at
the iso-cvt.el package for setting a conversion table.
The command iso-sgml2iso is pretty close to what you want.

Now, if you want to search a buffer for all characters belonging to 
some category, you can use a regexp. 

\ca matches any ascii characters (newlines excluded). Same as [[:ascii:]].
\Ca matches any non-ascii characters (newlines included).
\cl matches any latin characters (newlines excluded).
\Cl matches any non-latin characters (newlines included).

So the following command copies all non-latin characters to the scratch buffer.
M-x replace-regexp RET \Cl RET \,(princ \& (get-buffer "*scratch*"))