how to find encoding violations in Emacs buffer?

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* how to find encoding violations in Emacs buffer?
@ 2006-12-12 18:18 riccardo.murri
  2006-12-12 20:39 ` Lennart Borgman
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: riccardo.murri @ 2006-12-12 18:18 UTC (permalink / raw)


Hello,

from time to time, a buffer gets some spurious character in and Emacs
refuses to save it in the correct encoding. So I am presented with the
choice of other different encodings.

However, in most of the cases, I know that the file *should* be UTF-8
encoded.  So I would rather like to find out where the offending
character is and correct it, instead of choosing a different encoding.

Is there any function/package/elisp hack to find/highlight characters
in a buffer that Emacs could not encode as UTF-8?

Thank you for any hint!

Riccardo

P.S. Currently running 22.0.90

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: how to find encoding violations in Emacs buffer?
  2006-12-12 18:18 how to find encoding violations in Emacs buffer? riccardo.murri
@ 2006-12-12 20:39 ` Lennart Borgman
  2006-12-12 20:56   ` Riccardo Murri
       [not found] ` <mailman.1801.1165956001.2155.help-gnu-emacs@gnu.org>
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Lennart Borgman @ 2006-12-12 20:39 UTC (permalink / raw)
  Cc: help-gnu-emacs

riccardo.murri@gmail.com wrote:
> Hello,
>
> from time to time, a buffer gets some spurious character in and Emacs
> refuses to save it in the correct encoding. So I am presented with the
> choice of other different encodings.
>
> However, in most of the cases, I know that the file *should* be UTF-8
> encoded.  So I would rather like to find out where the offending
> character is and correct it, instead of choosing a different encoding.
>
> Is there any function/package/elisp hack to find/highlight characters
> in a buffer that Emacs could not encode as UTF-8?
>
> Thank you for any hint!
>
> Riccardo
>
> P.S. Currently running 22.0.90
>   

I think someone said there was, but I have never seen it though I have 
had these problem quite often. Can't remember the details now.

Which platform are you on? I am using MS Windows (2000 or XP).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: how to find encoding violations in Emacs buffer?
  2006-12-12 20:39 ` Lennart Borgman
@ 2006-12-12 20:56   ` Riccardo Murri
  0 siblings, 0 replies; 9+ messages in thread
From: Riccardo Murri @ 2006-12-12 20:56 UTC (permalink / raw)
  Cc: help-gnu-emacs

On 12/12/06, Lennart Borgman <lennart.borgman.073@student.lu.se> wrote:
> riccardo.murri@gmail.com wrote:
> > Hello,
> >
> > from time to time, a buffer gets some spurious character in and Emacs
> > refuses to save it in the correct encoding. So I am presented with the
> > choice of other different encodings.
> >
> > However, in most of the cases, I know that the file *should* be UTF-8
> > encoded.  So I would rather like to find out where the offending
> > character is and correct it, instead of choosing a different encoding.
> >
> > Is there any function/package/elisp hack to find/highlight characters
> > in a buffer that Emacs could not encode as UTF-8?
> >
>
> I think someone said there was, but I have never seen it though I have
> had these problem quite often. Can't remember the details now.
>
> Which platform are you on? I am using MS Windows (2000 or XP).
>

Debian (etch) GNU/Linux, running emacs-snapshot-gtk, 22.0.90

Does the platform matter in this case?  I would have thought the elisp
interface
hides those details...

Riccardo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: how to find encoding violations in Emacs buffer?
       [not found] ` <mailman.1801.1165956001.2155.help-gnu-emacs@gnu.org>
@ 2006-12-12 23:45   ` B. T. Raven
  0 siblings, 0 replies; 9+ messages in thread
From: B. T. Raven @ 2006-12-12 23:45 UTC (permalink / raw)

"Lennart Borgman" <lennart.borgman.073@student.lu.se> wrote in message
news:mailman.1801.1165956001.2155.help-gnu-emacs@gnu.org...
> riccardo.murri@gmail.com wrote:
> > Hello,
> >
> > from time to time, a buffer gets some spurious character in and Emacs
> > refuses to save it in the correct encoding. So I am presented with the
> > choice of other different encodings.
> >
> > However, in most of the cases, I know that the file *should* be UTF-8
> > encoded.  So I would rather like to find out where the offending
> > character is and correct it, instead of choosing a different encoding.
> >
> > Is there any function/package/elisp hack to find/highlight characters
> > in a buffer that Emacs could not encode as UTF-8?
> >
> > Thank you for any hint!
> >
> > Riccardo
> >
> > P.S. Currently running 22.0.90
> >
>
> I think someone said there was, but I have never seen it though I have
> had these problem quite often. Can't remember the details now.
>
> Which platform are you on? I am using MS Windows (2000 or XP).
>
>

I also use utf-8 almost all the time (dos coding system for text files to
be used in the w32 environment, e.g. batch files, etc.) For files written
by and read back into w32 Emacs,  I think that it's important to
distinguish between characters that are not displayed  due to the lack of
a glyph in the font(s) [shown as hollow rectangles, solid rhombs, question
marks] and those missing because of some incompatibility between character
mappings [I think these are shown as escaped octal sequences that result
from differences in national character sets in the range just above ascii
(128-255)]. I have always been able to fix these with M-%. If you can find
tables of these extended characters from the various European languages,
it should be fairly easy for someone (not me) to cobble together an elisp
routine to deal with this problem. I would guess that 10 or 20 of these
characters account for more than 99% of the mismappings.

Ed

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: how to find encoding violations in Emacs buffer?
  2006-12-12 18:18 how to find encoding violations in Emacs buffer? riccardo.murri
  2006-12-12 20:39 ` Lennart Borgman
       [not found] ` <mailman.1801.1165956001.2155.help-gnu-emacs@gnu.org>
@ 2006-12-13  4:26 ` Eli Zaretskii
       [not found] ` <mailman.1814.1165984004.2155.help-gnu-emacs@gnu.org>
  3 siblings, 0 replies; 9+ messages in thread
From: Eli Zaretskii @ 2006-12-13  4:26 UTC (permalink / raw)


> From: riccardo.murri@gmail.com
> Date: 12 Dec 2006 10:18:13 -0800
> 
> from time to time, a buffer gets some spurious character in and Emacs
> refuses to save it in the correct encoding. So I am presented with the
> choice of other different encodings.
> 
> However, in most of the cases, I know that the file *should* be UTF-8
> encoded.  So I would rather like to find out where the offending
> character is and correct it, instead of choosing a different encoding.
> 
> Is there any function/package/elisp hack to find/highlight characters
> in a buffer that Emacs could not encode as UTF-8?

Emacs 22 already shows the problematic characters.  Please look closer
at the text of the buffer where Emacs tells you why it needs your
decision about the encoding.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: how to find encoding violations in Emacs buffer?
       [not found] ` <mailman.1814.1165984004.2155.help-gnu-emacs@gnu.org>
@ 2006-12-13  8:39   ` riccardo.murri
  2006-12-13 10:45     ` Peter Dyballa
       [not found]     ` <mailman.1823.1166006732.2155.help-gnu-emacs@gnu.org>
  0 siblings, 2 replies; 9+ messages in thread
From: riccardo.murri @ 2006-12-13  8:39 UTC (permalink / raw)

On Dec 13, 5:26 am, Eli Zaretskii <e... -at- gnu.org> wrote:
> > From: riccardo.mu... -at- gmail.com
> > Date: 12 Dec 2006 10:18:13 -0800
>
> > from time to time, a buffer gets some spurious character in and Emacs
> > refuses to save it in the correct encoding. So I am presented with the
> > choice of other different encodings.
>
> > However, in most of the cases, I know that the file *should* be UTF-8
> > encoded.  So I would rather like to find out where the offending
> > character is and correct it, instead of choosing a different encoding.
>
> > Is there any function/package/elisp hack to find/highlight characters
> > in a buffer that Emacs could not encode as UTF-8?
>
> Emacs 22 already shows the problematic characters.  Please look closer
> at the text of the buffer where Emacs tells you why it needs your
> decision about the encoding.

Yes, but it may be hard to spot one single problematic character in a
large buffer.  In the case at hand, I had one Latin-1 "ù" in a 20k
UTF-8 text, and, since the encoding was thus incorrect and could not
be autodetected, Emacs displayed al non-ASCII characters as \xxx
escape sequences...

Isn't there a way to implement a "goto-next-problematic-char" elisp
function?  UTF-8 has a rather simple algorithm to detect encoding
violations, which can point at the precise point where a byte sequence
violates UTF-8 rules, but I wondered if Emacs had a more general
interface: if it knows where in the buffer the encoding violations
are located, one would assume that this information would be available
at elisp level.

Riccardo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: how to find encoding violations in Emacs buffer?
  2006-12-13  8:39   ` riccardo.murri
@ 2006-12-13 10:45     ` Peter Dyballa
       [not found]     ` <mailman.1823.1166006732.2155.help-gnu-emacs@gnu.org>
  1 sibling, 0 replies; 9+ messages in thread
From: Peter Dyballa @ 2006-12-13 10:45 UTC (permalink / raw)
  Cc: help-gnu-emacs


Am 13.12.2006 um 09:39 schrieb riccardo.murri@gmail.com:

> Yes, but it may be hard to spot one single problematic character in a
> large buffer.  In the case at hand, I had one Latin-1 "ù" in a 20k
> UTF-8 text,

This character is an UTF-8 entity:

	[ù]  00F9  LATIN SMALL LETTER U WITH GRAVE

It cannot be the cause. In UTF-8 it's encoded as C3 B9. Kermit and  
Unicode Emacs 23 have a file UnicodeData.txt that describes a lot of  
Unicode characters. A bit more complete is Kermit's utf8.txt from  
which the above excerpt comes.

>
> Isn't there a way to implement a "goto-next-problematic-char" elisp
> function?  UTF-8 has a rather simple algorithm to detect encoding
> violations, which can point at the precise point where a byte sequence
> violates UTF-8 rules, but I wondered if Emacs had a more general
> interface: if it knows where in the buffer the encoding violations
> are located, one would assume that this information would be available
> at elisp level.

There is something like this already implemented in PostScript  
printing: when the buffer contains characters outside a specific ISO  
Latin encoding up to a dozen of them is presented in a warning buffer.

--
Greetings

   Pete
               <\
                 \__     O                       __O
                 | O\   _\\/\-%                _`\<,
                 '()-'-(_)--(_)               (_)/(_)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: how to find encoding violations in Emacs buffer?
       [not found]     ` <mailman.1823.1166006732.2155.help-gnu-emacs@gnu.org>
@ 2006-12-13 12:34       ` riccardo.murri
  2006-12-13 12:55         ` Peter Dyballa
  0 siblings, 1 reply; 9+ messages in thread
From: riccardo.murri @ 2006-12-13 12:34 UTC (permalink / raw)




On Dec 13, 11:45 am, Peter Dyballa <Peter_Dyba... -at- Web.DE> wrote:
> Am 13.12.2006 um 09:39 schrieb riccardo.mu.. -at- gmail.com:
>
> > Yes, but it may be hard to spot one single problematic character in a
> > large buffer.  In the case at hand, I had one Latin-1 "ù" in a 20k
> > UTF-8 text,This character is an UTF-8 entity:
>
>         [ù]  00F9  LATIN SMALL LETTER U WITH GRAVE
>
> It cannot be the cause. In UTF-8 it's encoded as C3 B9.

Yes, but the file had 0xF9 in it instead of 0xC3B9, which caused UTF-8
auto-detection to fail.

> > Isn't there a way to implement a "goto-next-problematic-char" elisp
> > function?  UTF-8 has a rather simple algorithm to detect encoding
> > violations, which can point at the precise point where a byte sequence
> > violates UTF-8 rules, but I wondered if Emacs had a more general
> > interface: if it knows where in the buffer the encoding violations
> > are located, one would assume that this information would be available
> > at elisp level.
>
> There is something like this already implemented in PostScript
> printing: when the buffer contains characters outside a specific ISO
> Latin encoding up to a dozen of them is presented in a warning buffer.
>

Thank you for the pointer!  I'll have a look at that code.

Greetings,
Riccardo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: how to find encoding violations in Emacs buffer?
  2006-12-13 12:34       ` riccardo.murri
@ 2006-12-13 12:55         ` Peter Dyballa
  0 siblings, 0 replies; 9+ messages in thread
From: Peter Dyballa @ 2006-12-13 12:55 UTC (permalink / raw)
  Cc: help-gnu-emacs

Am 13.12.2006 um 13:34 schrieb riccardo.murri@gmail.com:

> Yes, but the file had 0xF9 in it instead of 0xC3B9, which caused UTF-8
> auto-detection to fail.

I think you can from the "Options -> Mule -> Set Coding Systems" menu  
choose 'For Reverting This File Now' (C-x RET r) to open the contents  
of a non-UTF-8 file as UTF-8 ...

--
Greetings

   Pete

Ce qui été compris n'existe plus.    (Paul Eluard)

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2006-12-13 12:55 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-12-12 18:18 how to find encoding violations in Emacs buffer? riccardo.murri
2006-12-12 20:39 ` Lennart Borgman
2006-12-12 20:56   ` Riccardo Murri
     [not found] ` <mailman.1801.1165956001.2155.help-gnu-emacs@gnu.org>
2006-12-12 23:45   ` B. T. Raven
2006-12-13  4:26 ` Eli Zaretskii
     [not found] ` <mailman.1814.1165984004.2155.help-gnu-emacs@gnu.org>
2006-12-13  8:39   ` riccardo.murri
2006-12-13 10:45     ` Peter Dyballa
     [not found]     ` <mailman.1823.1166006732.2155.help-gnu-emacs@gnu.org>
2006-12-13 12:34       ` riccardo.murri
2006-12-13 12:55         ` Peter Dyballa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).