how to scan file for non-ascii chars (eg cut-n-paste from ms-word)

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* how to scan file for non-ascii chars (eg cut-n-paste from ms-word)
@ 2011-01-09  0:53 David Combs
  2011-01-09 14:23 ` Eli Zaretskii
       [not found] ` <mailman.11.1294583034.18702.help-gnu-emacs@gnu.org>
  0 siblings, 2 replies; 12+ messages in thread
From: David Combs @ 2011-01-09  0:53 UTC (permalink / raw)
  To: help-gnu-emacs

When I 'cut-n-paste' from eg ms-word-produced document, into an
emacs buffer (ie ascii), you get all kinds of "non-ascii" chars,
eg left and right double-quotes, like these:

Char: . (8221, #o20035, #x201d) point=250 of 4096 (6%) column=7
Char: . (8220, #o20034, #x201c) point=218 of 4096 (5%) column=42

accents, and so on.

When I go to save the buffer, emacs will ask if I want to
save it in eg japanese format.  Not exactly what I want.

What I'd like to do is change those "strange" characters
to their plain-ascii "equivalent", so to speak.  Like
'"' for double quote (left OR right), etc.

Surely I'm not the only one to experience this difficulty:
what work-arounds have YOU found?

Thanks!

David

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: how to scan file for non-ascii chars (eg cut-n-paste from ms-word)
  2011-01-09  0:53 how to scan file for non-ascii chars (eg cut-n-paste from ms-word) David Combs
@ 2011-01-09 14:23 ` Eli Zaretskii
  2011-01-09 16:24   ` Kenneth Goldman
       [not found] ` <mailman.11.1294583034.18702.help-gnu-emacs@gnu.org>
  1 sibling, 1 reply; 12+ messages in thread
From: Eli Zaretskii @ 2011-01-09 14:23 UTC (permalink / raw)
  To: help-gnu-emacs

> From: dkcombs@panix.com (David Combs)
> Newsgroups: gnu.emacs.help
> Date: 8 Jan 2011 19:53:01 -0500
> 
> When I 'cut-n-paste' from eg ms-word-produced document, into an
> emacs buffer (ie ascii), you get all kinds of "non-ascii" chars,
> eg left and right double-quotes, like these:
> 
> 
> Char: . (8221, #o20035, #x201d) point=250 of 4096 (6%) column=7
> Char: . (8220, #o20034, #x201c) point=218 of 4096 (5%) column=42
> 
> 
> accents, and so on.
> 
> When I go to save the buffer, emacs will ask if I want to
> save it in eg japanese format.  Not exactly what I want.

Doesn't it suggest utf-8 as one of the possible encodings?  If so, why
not use utf-8 and leave these characters in the file?

> What I'd like to do is change those "strange" characters
> to their plain-ascii "equivalent", so to speak.  Like
> '"' for double quote (left OR right), etc.

Not sure why would you want that, but doesn't M-% solve this problem
nicely?  If not, why not?



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: how to scan file for non-ascii chars (eg cut-n-paste from ms-word)
  2011-01-09 14:23 ` Eli Zaretskii
@ 2011-01-09 16:24   ` Kenneth Goldman
  2011-01-09 17:30     ` Eli Zaretskii
       [not found]     ` <mailman.5.1294594249.11727.help-gnu-emacs@gnu.org>
  0 siblings, 2 replies; 12+ messages in thread
From: Kenneth Goldman @ 2011-01-09 16:24 UTC (permalink / raw)
  Cc: help-gnu-emacs

[-- Attachment #1: Type: text/plain, Size: 1582 bytes --]

> From: Eli Zaretskii <eliz@gnu.org>
> To: help-gnu-emacs@gnu.org
> Date: 01/09/2011 09:24 AM
> Subject: Re: how to scan file for non-ascii chars (eg cut-n-paste 
> from ms-word)
> Sent by: help-gnu-emacs-bounces+kgold=watson.ibm.com@gnu.org
> 
> > From: dkcombs@panix.com (David Combs)
> > Newsgroups: gnu.emacs.help
> > Date: 8 Jan 2011 19:53:01 -0500
> > 
> > When I 'cut-n-paste' from eg ms-word-produced document, into an
> > emacs buffer (ie ascii), you get all kinds of "non-ascii" chars,
> > eg left and right double-quotes, like these:
> > 
> > 
> > Char: . (8221, #o20035, #x201d) point=250 of 4096 (6%) column=7
> > Char: . (8220, #o20034, #x201c) point=218 of 4096 (5%) column=42
> > 
> > 
> > accents, and so on.
> > 
> > When I go to save the buffer, emacs will ask if I want to
> > save it in eg japanese format.  Not exactly what I want.
> 
> Doesn't it suggest utf-8 as one of the possible encodings?  If so, why
> not use utf-8 and leave these characters in the file?
> 

I've seen the same issue.  If I'm writing source code, I want plain ASCII,
nothing unusual that a compiler or linker might complain about.

> > What I'd like to do is change those "strange" characters
> > to their plain-ascii "equivalent", so to speak.  Like
> > '"' for double quote (left OR right), etc.
> 
> Not sure why would you want that, but doesn't M-% solve this problem
> nicely?  If not, why not?

query-replace works once one has found the non-ASCII character.  However,
it's often not obvious where the offending text is.

Is there a way to search for anything that isn't ASCII?

[-- Attachment #2: Type: text/html, Size: 2335 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: how to scan file for non-ascii chars (eg cut-n-paste from ms-word)
  2011-01-09 16:24   ` Kenneth Goldman
@ 2011-01-09 17:30     ` Eli Zaretskii
  2011-01-09 17:53       ` how to scan file for non-ascii chars (egcut-n-paste " Drew Adams
       [not found]       ` <mailman.8.1294595713.11727.help-gnu-emacs@gnu.org>
       [not found]     ` <mailman.5.1294594249.11727.help-gnu-emacs@gnu.org>
  1 sibling, 2 replies; 12+ messages in thread
From: Eli Zaretskii @ 2011-01-09 17:30 UTC (permalink / raw)
  To: help-gnu-emacs

> Cc: help-gnu-emacs@gnu.org
> From: Kenneth Goldman <kgoldman@us.ibm.com>
> Date: Sun, 9 Jan 2011 11:24:01 -0500
> 
> query-replace works once one has found the non-ASCII character.  However,
> it's often not obvious where the offending text is.

When Emacs asks you to select a suitable encoding, it highlights these
characters, so you can see where they are in the buffer.

> Is there a way to search for anything that isn't ASCII?

If the above is not enough, then try

   M-: (skip-chars-forward "\000-\377") RET



^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: how to scan file for non-ascii chars (egcut-n-paste from ms-word)
  2011-01-09 17:30     ` Eli Zaretskii
@ 2011-01-09 17:53       ` Drew Adams
       [not found]       ` <mailman.8.1294595713.11727.help-gnu-emacs@gnu.org>
  1 sibling, 0 replies; 12+ messages in thread
From: Drew Adams @ 2011-01-09 17:53 UTC (permalink / raw)
  To: 'Eli Zaretskii', help-gnu-emacs

> > query-replace works once one has found the non-ASCII 
> > character.  However, it's often not obvious where the
> > offending text is.
> 
> When Emacs asks you to select a suitable encoding, it highlights these
> characters, so you can see where they are in the buffer.
> 
> > Is there a way to search for anything that isn't ASCII?
> 
> If the above is not enough, then try
>    M-: (skip-chars-forward "\000-\377") RET

Better: you can easily search incrementally for non-ascii chars, starting with
Emacs 22, like this:  C-M-s [^[:ascii:]]

I just filed an Emacs bug (#7809) to get this info added to the Emacs manual.
It was added to the Elisp manual (node `Regexp Special') in Emacs 23, but
apparently it wasn't considered useful info for users. ;-)  (Probably just an
oversight, actually.)

---

If you use Icicles, you can also see and search for all
sequences of non-ascii chars this way:  C-` [^[:ascii]]+
`S-TAB' to see hits, `C-next' to visit them, etc.
http://www.emacswiki.org/emacs/Icicles_-_Search_Commands%2c_Overview

When the set of hits is thus those defined by [^[:ascii]]+, you can type any
string using a subset of those chars (i.e., one or more particular non-ascii
chars) to narrow the hits, then visit any of those, and optionally replace any
or all of them with alternatives.
http://www.emacswiki.org/emacs/Icicles_-_Search-And-Replace

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: how to scan file for non-ascii chars (eg cut-n-paste from ms-word)
       [not found] ` <mailman.11.1294583034.18702.help-gnu-emacs@gnu.org>
@ 2011-01-18 19:54   ` David Combs
  2011-01-18 21:32     ` harven
  2011-01-19  4:38     ` Teemu Likonen
  0 siblings, 2 replies; 12+ messages in thread
From: David Combs @ 2011-01-18 19:54 UTC (permalink / raw)
  To: help-gnu-emacs

In article <mailman.11.1294583034.18702.help-gnu-emacs@gnu.org>,
Eli Zaretskii  <eliz@gnu.org> wrote:
>> From: dkcombs@panix.com (David Combs)
>> Newsgroups: gnu.emacs.help
>> Date: 8 Jan 2011 19:53:01 -0500
>> 
>> When I 'cut-n-paste' from eg ms-word-produced document, into an
>> emacs buffer (ie ascii), you get all kinds of "non-ascii" chars,
>> eg left and right double-quotes, like these:
>> 
>> 
>> Char: . (8221, #o20035, #x201d) point=250 of 4096 (6%) column=7
>> Char: . (8220, #o20034, #x201c) point=218 of 4096 (5%) column=42
>> 
>> 
>> accents, and so on.
>> 
>> When I go to save the buffer, emacs will ask if I want to
>> save it in eg japanese format.  Not exactly what I want.
>
>Doesn't it suggest utf-8 as one of the possible encodings?  If so, why
>not use utf-8 and leave these characters in the file?

Because (er, as an excuse) I often want to copy-paste them into
an ASCII hints-and-tricks file I keep for my own use, and
which I then edit and search-within via emacs (of course).

Suppose I want to PRINT from that supposedly-ASCII file --
does my old (but wonderful) HP-1200 laserjet -- all it has
for fonts are the original times, some-sans-serif one, 
something else (I forget), and "symbol".  Isn't that
a problem?

FURTHER, and more importantly, how do I *search* for
one of these funny things, a left-double-quote, say?
It's so *easy* to just hit C-s "!

Given my current state of emacs-knowledge on "foreign"
fonts (like zero), that's what I say -- until I can
somehow learn more.

Thanks!

>
>> What I'd like to do is change those "strange" characters
>> to their plain-ascii "equivalent", so to speak.  Like
>> '"' for double quote (left OR right), etc.
>
>Not sure why would you want that, but doesn't M-% solve this problem
>nicely?  If not, why not?
>

You mean do a query-replace on each non-ascii char?  How do I 
even know which ones are even *in* some buffer of text?

What'd be nice is something that went through the whole
buffer *once*, doing the "right thing" with each
non-ascii char.

Do I make any sense?  Or do I not really understand?

Thanks,

David

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: how to scan file for non-ascii chars (eg cut-n-paste from ms-word)
       [not found]     ` <mailman.5.1294594249.11727.help-gnu-emacs@gnu.org>
@ 2011-01-18 20:06       ` David Combs
  0 siblings, 0 replies; 12+ messages in thread
From: David Combs @ 2011-01-18 20:06 UTC (permalink / raw)
  To: help-gnu-emacs

In article <mailman.5.1294594249.11727.help-gnu-emacs@gnu.org>,
Eli Zaretskii  <eliz@gnu.org> wrote:
>> Cc: help-gnu-emacs@gnu.org
>> From: Kenneth Goldman <kgoldman@us.ibm.com>
>> Date: Sun, 9 Jan 2011 11:24:01 -0500
>> 
>> query-replace works once one has found the non-ASCII character.  However,
>> it's often not obvious where the offending text is.
>
>When Emacs asks you to select a suitable encoding, it highlights these
>characters, so you can see where they are in the buffer.
>
>> Is there a way to search for anything that isn't ASCII?
>
>If the above is not enough, then try
>
>   M-: (skip-chars-forward "\000-\377") RET
>

Nice, that skip-chars-forward:

| skip-chars-forward is a built-in function in `C source code'.
| (skip-chars-forward string &optional lim)
| 
| Move point forward, stopping before a char not in string, or at pos lim.
| string is like the inside of a `[...]' in a regular expression
| except that `]' is never special and `\' quotes `^', `-' or `\'
|  (but not as the end of a range; quoting is never needed there).
| Thus, with arg "a-zA-Z", this skips letters stopping before first nonletter.
| With arg "^a-zA-Z", skips nonletters stopping before first letter.
| Char classes, e.g. `[:alpha:]', are supported.
| 
| Returns the distance traveled, either zero or positive.

But you're still down to doing them *one at a time*!

(If there's 200 left-right-quote pairs in a buffer, plus
who knows what else there is, isn't it sort of a pain
to use that as a way to find out what the non-ascii
chars *are*?)

David

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: how to scan file for non-ascii chars (egcut-n-paste from ms-word)
       [not found]       ` <mailman.8.1294595713.11727.help-gnu-emacs@gnu.org>
@ 2011-01-18 20:19         ` David Combs
  2011-01-19  2:02           ` how to scan file for non-ascii chars(egcut-n-paste " Drew Adams
  0 siblings, 1 reply; 12+ messages in thread
From: David Combs @ 2011-01-18 20:19 UTC (permalink / raw)
  To: help-gnu-emacs

In article <mailman.8.1294595713.11727.help-gnu-emacs@gnu.org>,
Drew Adams <drew.adams@oracle.com> wrote:
...
...

>
>---
>
>If you use Icicles, you can also see and search for all
>sequences of non-ascii chars this way:  C-` [^[:ascii]]+
>`S-TAB' to see hits, `C-next' to visit them, etc.

Have not used that package, hear lots of good things about it!

Question: that line-ending plus-sign -- part of the command-string,
or some kind of continuation char?  (Oh no, obvious: the "one or more" regexp-char!)

For those (few? many?) of us who don't know icicles, could you
maybe how those two command-strings work, ie sort of translate
each of them into some kind of "emacs-english"?  THANKS!

>http://www.emacswiki.org/emacs/Icicles_-_Search_Commands%2c_Overview
>
>When the set of hits is thus those defined by [^[:ascii]]+, you can type any
>string using a subset of those chars (i.e., one or more particular non-ascii
>chars) to narrow the hits, then visit any of those, and optionally replace any
>or all of them with alternatives.
>http://www.emacswiki.org/emacs/Icicles_-_Search-And-Replace
>
>

Hmmm.  Maybe a perl program, with hashes, etc, 
I should do it that way?

Seems like overkill, unwiedly too, for something SO COMMONLY
ENCOUNTED...

David

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: how to scan file for non-ascii chars (eg cut-n-paste from ms-word)
  2011-01-18 19:54   ` David Combs
@ 2011-01-18 21:32     ` harven
  2011-01-19  4:38     ` Teemu Likonen
  1 sibling, 0 replies; 12+ messages in thread
From: harven @ 2011-01-18 21:32 UTC (permalink / raw)
  To: help-gnu-emacs

dkcombs@panix.com (David Combs) writes:

> FURTHER, and more importantly, how do I *search* for
> one of these funny things, a left-double-quote, say?
> It's so *easy* to just hit C-s "!

You can go to the next non-ascii character using
C-M-s [^[:ascii:]] RET
Repeating C-s after that will recurse through the non-ascii characters.

> You mean do a query-replace on each non-ascii char?  How do I 
> even know which ones are even *in* some buffer of text?

You can use the next command to list all characters in the buffer together
with their frequencies. The non-ascii one should appear at the end.

(defun frequency ()
"Compute the frequencies for each character in the buffer.
 The result appears in another buffer called *frequency*"
(interactive)
(save-excursion
  (goto-char (point-min))
  (let ((freq (make-hash-table :test 'equal)))
    (while (re-search-forward "." nil t)
      (puthash (match-string 0)
        (1+  (gethash (match-string 0) freq 0))
               freq))
    (pop-to-buffer "*frequency*")
    (erase-buffer)
    (maphash
     '(lambda (key value)
        (insert key "  " (number-to-string value) "\n"))
     freq))
  (sort-numeric-fields -1 (point-min) (point-max))
  (reverse-region (point-min) (point-max))
  (other-window 1)))

>
> What'd be nice is something that went through the whole
> buffer *once*, doing the "right thing" with each
> non-ascii char.
>
> Do I make any sense?  Or do I not really understand?

Yes it makes sense.

Have a look at iso-cvt.el. This package provides commands to handle iso8859-1
characters. You can find there a function called iso-translate-conventions. 
This function translates character according to a translation table. I am not
aware of a table giving an ascii translation for all utf-8 characters, so you
will have to make your own, along the lines of

(defvar my-iso-trans-tab
  '(("à" "a")
    ("é" "e")
    ("ß" "s")
    ("ñ" "~n"))
  "Translation table for translating some character to ascii.
   This table is not exhaustive.")

Then, assuming you have executed iso-translate-conventions from iso-cvt.el,
use the following command to translate the selected region.

(defun my-iso-all2ascii (from to &optional buffer)
 "Translate to ascii characters.
Translate the region between FROM and TO using the table
`my-iso-trans-tab'.
Optional arg BUFFER is ignored (for use in `format-alist')."
 (interactive "*r")
 (iso-translate-conventions from to my-iso-trans-tab))

Hope that helps

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: how to scan file for non-ascii chars(egcut-n-paste from ms-word)
  2011-01-18 20:19         ` David Combs
@ 2011-01-19  2:02           ` Drew Adams
  0 siblings, 0 replies; 12+ messages in thread
From: Drew Adams @ 2011-01-19  2:02 UTC (permalink / raw)
  To: 'David Combs', help-gnu-emacs

> >If you use Icicles, you can also see and search for all
> >sequences of non-ascii chars this way:  C-` [^[:ascii]]+
> >`S-TAB' to see hits, `C-next' to visit them, etc.
> 
> Have not used that package, hear lots of good things about it!
> 
> Question: that line-ending plus-sign -- part of the command-string,
> or some kind of continuation char?  (Oh no, obvious: the "one 
> or more" regexp-char!)

Yes, the + just means one or more.  It applies to the character set [^[:ascii]],
which means any character except (`^') the characters in the character class
[:ascii], which means any non-ascii character.  The latter information
(character classes) is still missing from the Emacs manual, but you will find it
in the Elisp manual, nodes `Regexp Special' and `Char Classes'.

> For those (few? many?) of us who don't know icicles, could you
> maybe how those two command-strings work, ie sort of translate
> each of them into some kind of "emacs-english"?  THANKS!

I guess you're asking about `C-`' and `S-TAB'.  In Icicle mode, `C-`' is bound
to `icicle-search' by default, and `S-TAB' does regexp completion.

C-` [^[:ascii]]+ parses the buffer into search contexts (the regions that match
[^[:ascii]]+), and it reads your input with completion, making those contexts
available as the set of completion candidates.  IOW, you use completion to
choose which search hits to visit.

When you hit S-TAB it completes your minibuffer input (empty so far) against the
candidate search contexts, showing those that match your input in buffer
*Completions*.  Since your input is empty they all match and are all shown.

If you can type non-ascii chars (or paste them into the minibuffer), then doing
that filters the candidates to those that match the sequence of chars you
inserted.  For example, if you type a non-ascii double-quote, then only the
contexts that contain that char are now the candidates.  Change your minibuffer
input and you change the set of matching contexts, which you can visit.

Whatever the current set of matching candidates is, you can visit any of their
locations by cycling among them using `next' (PageDown) and `C-RET' to choose.
Or just visit some in sequence (buffer order, by default) using `C-next'.  Or
just visit some by clicking `C-mouse-2' on them in *Completions*.  `RET' (or
`C-g') ends the tour.

You can also perform replacements of either an entire search context or just the
part(s) that your input matches.  You could, for example, replace all of the
non-ascii double-quote chars by an ascii double-quote.

> >http://www.emacswiki.org/emacs/Icicles_-_Search_Commands%2c_Overview
> >
> >When the set of hits is thus those defined by [^[:ascii]]+, 
> >you can type any string using a subset of those chars (i.e.,
> >one or more particular non-ascii chars) to narrow the hits,
> >then visit any of those, and optionally replace any
> >or all of them with alternatives.
> >http://www.emacswiki.org/emacs/Icicles_-_Search-And-Replace
> 
> Hmmm.  Maybe a perl program, with hashes, etc, I should do it
> that way? Seems like overkill, unwiedly too, for something SO
> COMMONLY ENCOUNTED...

As I also mentioned, and someone else did the same later, you can also use
[^[:ascii]] with incremental regexp search: `C-M-s'.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: how to scan file for non-ascii chars (eg cut-n-paste from ms-word)
  2011-01-18 19:54   ` David Combs
  2011-01-18 21:32     ` harven
@ 2011-01-19  4:38     ` Teemu Likonen
  2011-01-20  6:57       ` Kevin Rodgers
  1 sibling, 1 reply; 12+ messages in thread
From: Teemu Likonen @ 2011-01-19  4:38 UTC (permalink / raw)
  To: David Combs; +Cc: help-gnu-emacs

* 2011-01-18 19:54 (UTC), David Combs wrote:

>>> What I'd like to do is change those "strange" characters to their
>>> plain-ascii "equivalent", so to speak. Like '"' for double quote
>>> (left OR right), etc.

> What'd be nice is something that went through the whole buffer *once*,
> doing the "right thing" with each non-ascii char.

I don't know if Linux system's "iconv" utility is available in other
operating systems but at least GNU/Linux users could do it with this:

    C-u M-x shell-command-on-region RET iconv -t ascii//translit RET

with the region being the whole buffer. RET means <Enter> key, in Emacs
speak.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: how to scan file for non-ascii chars (eg cut-n-paste from ms-word)
  2011-01-19  4:38     ` Teemu Likonen
@ 2011-01-20  6:57       ` Kevin Rodgers
  0 siblings, 0 replies; 12+ messages in thread
From: Kevin Rodgers @ 2011-01-20  6:57 UTC (permalink / raw)
  To: help-gnu-emacs

On 1/18/11 9:38 PM, Teemu Likonen wrote:
>      C-u M-x shell-command-on-region RET iconv -t ascii//translit RET
>
> with the region being the whole buffer.

`C-x h' before `C-u M-x shell-command-on-region ...'

-- 
Kevin Rodgers
Denver, Colorado, USA




^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2011-01-20  6:57 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-01-09  0:53 how to scan file for non-ascii chars (eg cut-n-paste from ms-word) David Combs
2011-01-09 14:23 ` Eli Zaretskii
2011-01-09 16:24   ` Kenneth Goldman
2011-01-09 17:30     ` Eli Zaretskii
2011-01-09 17:53       ` how to scan file for non-ascii chars (egcut-n-paste " Drew Adams
     [not found]       ` <mailman.8.1294595713.11727.help-gnu-emacs@gnu.org>
2011-01-18 20:19         ` David Combs
2011-01-19  2:02           ` how to scan file for non-ascii chars(egcut-n-paste " Drew Adams
     [not found]     ` <mailman.5.1294594249.11727.help-gnu-emacs@gnu.org>
2011-01-18 20:06       ` how to scan file for non-ascii chars (eg cut-n-paste " David Combs
     [not found] ` <mailman.11.1294583034.18702.help-gnu-emacs@gnu.org>
2011-01-18 19:54   ` David Combs
2011-01-18 21:32     ` harven
2011-01-19  4:38     ` Teemu Likonen
2011-01-20  6:57       ` Kevin Rodgers

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.