Finding and mapping all UTF-8 characters

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* Finding and mapping all UTF-8 characters
@ 2009-12-05 16:03 deech
  2009-12-05 16:38 ` Pascal J. Bourguignon
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: deech @ 2009-12-05 16:03 UTC (permalink / raw)
  To: help-gnu-emacs

Hi all,
I recently cut-and-pasted large chunks of text into an HTML document.
When I tried to save the document I was warned that it was ISO-Latin
but there were UTF-8 characters in the text.

Is there a way to (1) search for the UTF-8 encoded characters in a
document and (2) map them to a sensible ASCII character?

Thanks ...
-deech

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Finding and mapping all UTF-8 characters
  2009-12-05 16:03 Finding and mapping all UTF-8 characters deech
@ 2009-12-05 16:38 ` Pascal J. Bourguignon
  2009-12-05 18:40 ` Peter Dyballa
  2009-12-05 20:29 ` harven
  2 siblings, 0 replies; 4+ messages in thread
From: Pascal J. Bourguignon @ 2009-12-05 16:38 UTC (permalink / raw)
  To: help-gnu-emacs

deech <aditya.siram@gmail.com> writes:

> Hi all,
> I recently cut-and-pasted large chunks of text into an HTML document.
> When I tried to save the document I was warned that it was ISO-Latin
> but there were UTF-8 characters in the text.

I doubt it warned that.

ISO-Latin is not a character encoding, it is a familly of character
encodings.  A HTML document is not encoded by a familly of encodings,
but by one single encoding.

UTF-8 is a character encoding.  A character is not a character
encoding.

So  a sentence saying that "a document is  ISO-Latin but there are
UTF-8 characters in the text." is totally meaningless.

> Is there a way to (1) search for the UTF-8 encoded characters in a
> document and

No it is not possible, because characters in a document are not
encoded, they are characters, that's all.

> (2) map them to a sensible ASCII character?

How do you map sensibly ∈, ㎲, 纺 or ⇣ to the characters in the ASCII
character set?

But even if you choosed a mapping (you could for example map the
character to their names: ELEMENT_OF, SQUARE_MU_S, U7EBA, and
DOWNWARDS_DASHED_ARROW), why would you want to do such a thing?

HTML is perfectly able to use encodings that can encode unicode
characters, and all the current browsers are able to deal with HTML
documents encoding unicode characters, so why would you want to
massacre your document?

(There's a valid reason to be wanting to do that, but if you don't
know it, then you don't have it).

-- 
__Pascal Bourguignon__

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Finding and mapping all UTF-8 characters
  2009-12-05 16:03 Finding and mapping all UTF-8 characters deech
  2009-12-05 16:38 ` Pascal J. Bourguignon
@ 2009-12-05 18:40 ` Peter Dyballa
  2009-12-05 20:29 ` harven
  2 siblings, 0 replies; 4+ messages in thread
From: Peter Dyballa @ 2009-12-05 18:40 UTC (permalink / raw)
  To: deech; +Cc: help-gnu-emacs

Am 05.12.2009 um 17:03 schrieb deech:

> Is there a way to (1) search for the UTF-8 encoded characters in a
> document

Yes. In GNU Emacs 23 I've seen in the *Warnings* buffer hyper-links to  
the characters not fitting into the specified encoding.

You could also search for the usual prefixes of UTF-{7,8,16} encoded  
characters.

> and (2) map them to a sensible ASCII character?

How can you map 100,000 or 200,000 characters to a very limited set of  
100? This mapping would be candidate for the most successful  
compression algorithm...

Besides, it's not sane to save a file in an encoding a when the file's  
header tells its contents is in encoding b.

--
Greetings

   Pete

If you don't find it in the index, look very carefully through the  
entire catalogue.
		–  Sears, Roebuck, and Co., Consumer's Guide, 1897

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Finding and mapping all UTF-8 characters
  2009-12-05 16:03 Finding and mapping all UTF-8 characters deech
  2009-12-05 16:38 ` Pascal J. Bourguignon
  2009-12-05 18:40 ` Peter Dyballa
@ 2009-12-05 20:29 ` harven
  2 siblings, 0 replies; 4+ messages in thread
From: harven @ 2009-12-05 20:29 UTC (permalink / raw)
  To: help-gnu-emacs

deech <aditya.siram@gmail.com> writes:

> Hi all,
> I recently cut-and-pasted large chunks of text into an HTML document.
> When I tried to save the document I was warned that it was ISO-Latin
> but there were UTF-8 characters in the text.

The warning actually contains a list of these characters, and you can click
on them to see where they are located in the buffer.

> Is there a way to (1) search for the UTF-8 encoded characters in a
> document and (2) map them to a sensible ASCII character?
>
> Thanks ...
> -deech

Instead of converting to latin-1, it is probably better to save the file
in another coding system. Just do
M-x set-buffer-file-coding-system RET utf-8 RET

On the other hand, if you were surprised by the unicode characters,
then this probably means that there are few of them. Have a look at
the iso-cvt.el package for setting a conversion table.
The command iso-sgml2iso is pretty close to what you want.

Now, if you want to search a buffer for all characters belonging to 
some category, you can use a regexp. 

\ca matches any ascii characters (newlines excluded). Same as [[:ascii:]].
\Ca matches any non-ascii characters (newlines included).
\cl matches any latin characters (newlines excluded).
\Cl matches any non-latin characters (newlines included).

So the following command copies all non-latin characters to the scratch buffer.
M-x replace-regexp RET \Cl RET \,(princ \& (get-buffer "*scratch*"))

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2009-12-05 20:29 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-05 16:03 Finding and mapping all UTF-8 characters deech
2009-12-05 16:38 ` Pascal J. Bourguignon
2009-12-05 18:40 ` Peter Dyballa
2009-12-05 20:29 ` harven

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).