* how to scan file for non-ascii chars (eg cut-n-paste from ms-word) @ 2011-01-09 0:53 David Combs 2011-01-09 14:23 ` Eli Zaretskii [not found] ` <mailman.11.1294583034.18702.help-gnu-emacs@gnu.org> 0 siblings, 2 replies; 12+ messages in thread From: David Combs @ 2011-01-09 0:53 UTC (permalink / raw) To: help-gnu-emacs When I 'cut-n-paste' from eg ms-word-produced document, into an emacs buffer (ie ascii), you get all kinds of "non-ascii" chars, eg left and right double-quotes, like these: Char: . (8221, #o20035, #x201d) point=250 of 4096 (6%) column=7 Char: . (8220, #o20034, #x201c) point=218 of 4096 (5%) column=42 accents, and so on. When I go to save the buffer, emacs will ask if I want to save it in eg japanese format. Not exactly what I want. What I'd like to do is change those "strange" characters to their plain-ascii "equivalent", so to speak. Like '"' for double quote (left OR right), etc. Surely I'm not the only one to experience this difficulty: what work-arounds have YOU found? Thanks! David ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: how to scan file for non-ascii chars (eg cut-n-paste from ms-word) 2011-01-09 0:53 how to scan file for non-ascii chars (eg cut-n-paste from ms-word) David Combs @ 2011-01-09 14:23 ` Eli Zaretskii 2011-01-09 16:24 ` Kenneth Goldman [not found] ` <mailman.11.1294583034.18702.help-gnu-emacs@gnu.org> 1 sibling, 1 reply; 12+ messages in thread From: Eli Zaretskii @ 2011-01-09 14:23 UTC (permalink / raw) To: help-gnu-emacs > From: dkcombs@panix.com (David Combs) > Newsgroups: gnu.emacs.help > Date: 8 Jan 2011 19:53:01 -0500 > > When I 'cut-n-paste' from eg ms-word-produced document, into an > emacs buffer (ie ascii), you get all kinds of "non-ascii" chars, > eg left and right double-quotes, like these: > > > Char: . (8221, #o20035, #x201d) point=250 of 4096 (6%) column=7 > Char: . (8220, #o20034, #x201c) point=218 of 4096 (5%) column=42 > > > accents, and so on. > > When I go to save the buffer, emacs will ask if I want to > save it in eg japanese format. Not exactly what I want. Doesn't it suggest utf-8 as one of the possible encodings? If so, why not use utf-8 and leave these characters in the file? > What I'd like to do is change those "strange" characters > to their plain-ascii "equivalent", so to speak. Like > '"' for double quote (left OR right), etc. Not sure why would you want that, but doesn't M-% solve this problem nicely? If not, why not? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: how to scan file for non-ascii chars (eg cut-n-paste from ms-word) 2011-01-09 14:23 ` Eli Zaretskii @ 2011-01-09 16:24 ` Kenneth Goldman 2011-01-09 17:30 ` Eli Zaretskii [not found] ` <mailman.5.1294594249.11727.help-gnu-emacs@gnu.org> 0 siblings, 2 replies; 12+ messages in thread From: Kenneth Goldman @ 2011-01-09 16:24 UTC (permalink / raw) Cc: help-gnu-emacs [-- Attachment #1: Type: text/plain, Size: 1582 bytes --] > From: Eli Zaretskii <eliz@gnu.org> > To: help-gnu-emacs@gnu.org > Date: 01/09/2011 09:24 AM > Subject: Re: how to scan file for non-ascii chars (eg cut-n-paste > from ms-word) > Sent by: help-gnu-emacs-bounces+kgold=watson.ibm.com@gnu.org > > > From: dkcombs@panix.com (David Combs) > > Newsgroups: gnu.emacs.help > > Date: 8 Jan 2011 19:53:01 -0500 > > > > When I 'cut-n-paste' from eg ms-word-produced document, into an > > emacs buffer (ie ascii), you get all kinds of "non-ascii" chars, > > eg left and right double-quotes, like these: > > > > > > Char: . (8221, #o20035, #x201d) point=250 of 4096 (6%) column=7 > > Char: . (8220, #o20034, #x201c) point=218 of 4096 (5%) column=42 > > > > > > accents, and so on. > > > > When I go to save the buffer, emacs will ask if I want to > > save it in eg japanese format. Not exactly what I want. > > Doesn't it suggest utf-8 as one of the possible encodings? If so, why > not use utf-8 and leave these characters in the file? > I've seen the same issue. If I'm writing source code, I want plain ASCII, nothing unusual that a compiler or linker might complain about. > > What I'd like to do is change those "strange" characters > > to their plain-ascii "equivalent", so to speak. Like > > '"' for double quote (left OR right), etc. > > Not sure why would you want that, but doesn't M-% solve this problem > nicely? If not, why not? query-replace works once one has found the non-ASCII character. However, it's often not obvious where the offending text is. Is there a way to search for anything that isn't ASCII? [-- Attachment #2: Type: text/html, Size: 2335 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: how to scan file for non-ascii chars (eg cut-n-paste from ms-word) 2011-01-09 16:24 ` Kenneth Goldman @ 2011-01-09 17:30 ` Eli Zaretskii 2011-01-09 17:53 ` how to scan file for non-ascii chars (egcut-n-paste " Drew Adams [not found] ` <mailman.8.1294595713.11727.help-gnu-emacs@gnu.org> [not found] ` <mailman.5.1294594249.11727.help-gnu-emacs@gnu.org> 1 sibling, 2 replies; 12+ messages in thread From: Eli Zaretskii @ 2011-01-09 17:30 UTC (permalink / raw) To: help-gnu-emacs > Cc: help-gnu-emacs@gnu.org > From: Kenneth Goldman <kgoldman@us.ibm.com> > Date: Sun, 9 Jan 2011 11:24:01 -0500 > > query-replace works once one has found the non-ASCII character. However, > it's often not obvious where the offending text is. When Emacs asks you to select a suitable encoding, it highlights these characters, so you can see where they are in the buffer. > Is there a way to search for anything that isn't ASCII? If the above is not enough, then try M-: (skip-chars-forward "\000-\377") RET ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: how to scan file for non-ascii chars (egcut-n-paste from ms-word) 2011-01-09 17:30 ` Eli Zaretskii @ 2011-01-09 17:53 ` Drew Adams [not found] ` <mailman.8.1294595713.11727.help-gnu-emacs@gnu.org> 1 sibling, 0 replies; 12+ messages in thread From: Drew Adams @ 2011-01-09 17:53 UTC (permalink / raw) To: 'Eli Zaretskii', help-gnu-emacs > > query-replace works once one has found the non-ASCII > > character. However, it's often not obvious where the > > offending text is. > > When Emacs asks you to select a suitable encoding, it highlights these > characters, so you can see where they are in the buffer. > > > Is there a way to search for anything that isn't ASCII? > > If the above is not enough, then try > M-: (skip-chars-forward "\000-\377") RET Better: you can easily search incrementally for non-ascii chars, starting with Emacs 22, like this: C-M-s [^[:ascii:]] I just filed an Emacs bug (#7809) to get this info added to the Emacs manual. It was added to the Elisp manual (node `Regexp Special') in Emacs 23, but apparently it wasn't considered useful info for users. ;-) (Probably just an oversight, actually.) --- If you use Icicles, you can also see and search for all sequences of non-ascii chars this way: C-` [^[:ascii]]+ `S-TAB' to see hits, `C-next' to visit them, etc. http://www.emacswiki.org/emacs/Icicles_-_Search_Commands%2c_Overview When the set of hits is thus those defined by [^[:ascii]]+, you can type any string using a subset of those chars (i.e., one or more particular non-ascii chars) to narrow the hits, then visit any of those, and optionally replace any or all of them with alternatives. http://www.emacswiki.org/emacs/Icicles_-_Search-And-Replace ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <mailman.8.1294595713.11727.help-gnu-emacs@gnu.org>]
* Re: how to scan file for non-ascii chars (egcut-n-paste from ms-word) [not found] ` <mailman.8.1294595713.11727.help-gnu-emacs@gnu.org> @ 2011-01-18 20:19 ` David Combs 2011-01-19 2:02 ` how to scan file for non-ascii chars(egcut-n-paste " Drew Adams 0 siblings, 1 reply; 12+ messages in thread From: David Combs @ 2011-01-18 20:19 UTC (permalink / raw) To: help-gnu-emacs In article <mailman.8.1294595713.11727.help-gnu-emacs@gnu.org>, Drew Adams <drew.adams@oracle.com> wrote: ... ... > >--- > >If you use Icicles, you can also see and search for all >sequences of non-ascii chars this way: C-` [^[:ascii]]+ >`S-TAB' to see hits, `C-next' to visit them, etc. Have not used that package, hear lots of good things about it! Question: that line-ending plus-sign -- part of the command-string, or some kind of continuation char? (Oh no, obvious: the "one or more" regexp-char!) For those (few? many?) of us who don't know icicles, could you maybe how those two command-strings work, ie sort of translate each of them into some kind of "emacs-english"? THANKS! >http://www.emacswiki.org/emacs/Icicles_-_Search_Commands%2c_Overview > >When the set of hits is thus those defined by [^[:ascii]]+, you can type any >string using a subset of those chars (i.e., one or more particular non-ascii >chars) to narrow the hits, then visit any of those, and optionally replace any >or all of them with alternatives. >http://www.emacswiki.org/emacs/Icicles_-_Search-And-Replace > > Hmmm. Maybe a perl program, with hashes, etc, I should do it that way? Seems like overkill, unwiedly too, for something SO COMMONLY ENCOUNTED... David ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: how to scan file for non-ascii chars(egcut-n-paste from ms-word) 2011-01-18 20:19 ` David Combs @ 2011-01-19 2:02 ` Drew Adams 0 siblings, 0 replies; 12+ messages in thread From: Drew Adams @ 2011-01-19 2:02 UTC (permalink / raw) To: 'David Combs', help-gnu-emacs > >If you use Icicles, you can also see and search for all > >sequences of non-ascii chars this way: C-` [^[:ascii]]+ > >`S-TAB' to see hits, `C-next' to visit them, etc. > > Have not used that package, hear lots of good things about it! > > Question: that line-ending plus-sign -- part of the command-string, > or some kind of continuation char? (Oh no, obvious: the "one > or more" regexp-char!) Yes, the + just means one or more. It applies to the character set [^[:ascii]], which means any character except (`^') the characters in the character class [:ascii], which means any non-ascii character. The latter information (character classes) is still missing from the Emacs manual, but you will find it in the Elisp manual, nodes `Regexp Special' and `Char Classes'. > For those (few? many?) of us who don't know icicles, could you > maybe how those two command-strings work, ie sort of translate > each of them into some kind of "emacs-english"? THANKS! I guess you're asking about `C-`' and `S-TAB'. In Icicle mode, `C-`' is bound to `icicle-search' by default, and `S-TAB' does regexp completion. C-` [^[:ascii]]+ parses the buffer into search contexts (the regions that match [^[:ascii]]+), and it reads your input with completion, making those contexts available as the set of completion candidates. IOW, you use completion to choose which search hits to visit. When you hit S-TAB it completes your minibuffer input (empty so far) against the candidate search contexts, showing those that match your input in buffer *Completions*. Since your input is empty they all match and are all shown. If you can type non-ascii chars (or paste them into the minibuffer), then doing that filters the candidates to those that match the sequence of chars you inserted. For example, if you type a non-ascii double-quote, then only the contexts that contain that char are now the candidates. Change your minibuffer input and you change the set of matching contexts, which you can visit. Whatever the current set of matching candidates is, you can visit any of their locations by cycling among them using `next' (PageDown) and `C-RET' to choose. Or just visit some in sequence (buffer order, by default) using `C-next'. Or just visit some by clicking `C-mouse-2' on them in *Completions*. `RET' (or `C-g') ends the tour. You can also perform replacements of either an entire search context or just the part(s) that your input matches. You could, for example, replace all of the non-ascii double-quote chars by an ascii double-quote. > >http://www.emacswiki.org/emacs/Icicles_-_Search_Commands%2c_Overview > > > >When the set of hits is thus those defined by [^[:ascii]]+, > >you can type any string using a subset of those chars (i.e., > >one or more particular non-ascii chars) to narrow the hits, > >then visit any of those, and optionally replace any > >or all of them with alternatives. > >http://www.emacswiki.org/emacs/Icicles_-_Search-And-Replace > > Hmmm. Maybe a perl program, with hashes, etc, I should do it > that way? Seems like overkill, unwiedly too, for something SO > COMMONLY ENCOUNTED... As I also mentioned, and someone else did the same later, you can also use [^[:ascii]] with incremental regexp search: `C-M-s'. ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <mailman.5.1294594249.11727.help-gnu-emacs@gnu.org>]
* Re: how to scan file for non-ascii chars (eg cut-n-paste from ms-word) [not found] ` <mailman.5.1294594249.11727.help-gnu-emacs@gnu.org> @ 2011-01-18 20:06 ` David Combs 0 siblings, 0 replies; 12+ messages in thread From: David Combs @ 2011-01-18 20:06 UTC (permalink / raw) To: help-gnu-emacs In article <mailman.5.1294594249.11727.help-gnu-emacs@gnu.org>, Eli Zaretskii <eliz@gnu.org> wrote: >> Cc: help-gnu-emacs@gnu.org >> From: Kenneth Goldman <kgoldman@us.ibm.com> >> Date: Sun, 9 Jan 2011 11:24:01 -0500 >> >> query-replace works once one has found the non-ASCII character. However, >> it's often not obvious where the offending text is. > >When Emacs asks you to select a suitable encoding, it highlights these >characters, so you can see where they are in the buffer. > >> Is there a way to search for anything that isn't ASCII? > >If the above is not enough, then try > > M-: (skip-chars-forward "\000-\377") RET > Nice, that skip-chars-forward: | skip-chars-forward is a built-in function in `C source code'. | (skip-chars-forward string &optional lim) | | Move point forward, stopping before a char not in string, or at pos lim. | string is like the inside of a `[...]' in a regular expression | except that `]' is never special and `\' quotes `^', `-' or `\' | (but not as the end of a range; quoting is never needed there). | Thus, with arg "a-zA-Z", this skips letters stopping before first nonletter. | With arg "^a-zA-Z", skips nonletters stopping before first letter. | Char classes, e.g. `[:alpha:]', are supported. | | Returns the distance traveled, either zero or positive. But you're still down to doing them *one at a time*! (If there's 200 left-right-quote pairs in a buffer, plus who knows what else there is, isn't it sort of a pain to use that as a way to find out what the non-ascii chars *are*?) David ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <mailman.11.1294583034.18702.help-gnu-emacs@gnu.org>]
* Re: how to scan file for non-ascii chars (eg cut-n-paste from ms-word) [not found] ` <mailman.11.1294583034.18702.help-gnu-emacs@gnu.org> @ 2011-01-18 19:54 ` David Combs 2011-01-18 21:32 ` harven 2011-01-19 4:38 ` Teemu Likonen 0 siblings, 2 replies; 12+ messages in thread From: David Combs @ 2011-01-18 19:54 UTC (permalink / raw) To: help-gnu-emacs In article <mailman.11.1294583034.18702.help-gnu-emacs@gnu.org>, Eli Zaretskii <eliz@gnu.org> wrote: >> From: dkcombs@panix.com (David Combs) >> Newsgroups: gnu.emacs.help >> Date: 8 Jan 2011 19:53:01 -0500 >> >> When I 'cut-n-paste' from eg ms-word-produced document, into an >> emacs buffer (ie ascii), you get all kinds of "non-ascii" chars, >> eg left and right double-quotes, like these: >> >> >> Char: . (8221, #o20035, #x201d) point=250 of 4096 (6%) column=7 >> Char: . (8220, #o20034, #x201c) point=218 of 4096 (5%) column=42 >> >> >> accents, and so on. >> >> When I go to save the buffer, emacs will ask if I want to >> save it in eg japanese format. Not exactly what I want. > >Doesn't it suggest utf-8 as one of the possible encodings? If so, why >not use utf-8 and leave these characters in the file? Because (er, as an excuse) I often want to copy-paste them into an ASCII hints-and-tricks file I keep for my own use, and which I then edit and search-within via emacs (of course). Suppose I want to PRINT from that supposedly-ASCII file -- does my old (but wonderful) HP-1200 laserjet -- all it has for fonts are the original times, some-sans-serif one, something else (I forget), and "symbol". Isn't that a problem? FURTHER, and more importantly, how do I *search* for one of these funny things, a left-double-quote, say? It's so *easy* to just hit C-s "! Given my current state of emacs-knowledge on "foreign" fonts (like zero), that's what I say -- until I can somehow learn more. Thanks! > >> What I'd like to do is change those "strange" characters >> to their plain-ascii "equivalent", so to speak. Like >> '"' for double quote (left OR right), etc. > >Not sure why would you want that, but doesn't M-% solve this problem >nicely? If not, why not? > You mean do a query-replace on each non-ascii char? How do I even know which ones are even *in* some buffer of text? What'd be nice is something that went through the whole buffer *once*, doing the "right thing" with each non-ascii char. Do I make any sense? Or do I not really understand? Thanks, David ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: how to scan file for non-ascii chars (eg cut-n-paste from ms-word) 2011-01-18 19:54 ` David Combs @ 2011-01-18 21:32 ` harven 2011-01-19 4:38 ` Teemu Likonen 1 sibling, 0 replies; 12+ messages in thread From: harven @ 2011-01-18 21:32 UTC (permalink / raw) To: help-gnu-emacs dkcombs@panix.com (David Combs) writes: > FURTHER, and more importantly, how do I *search* for > one of these funny things, a left-double-quote, say? > It's so *easy* to just hit C-s "! You can go to the next non-ascii character using C-M-s [^[:ascii:]] RET Repeating C-s after that will recurse through the non-ascii characters. > You mean do a query-replace on each non-ascii char? How do I > even know which ones are even *in* some buffer of text? You can use the next command to list all characters in the buffer together with their frequencies. The non-ascii one should appear at the end. (defun frequency () "Compute the frequencies for each character in the buffer. The result appears in another buffer called *frequency*" (interactive) (save-excursion (goto-char (point-min)) (let ((freq (make-hash-table :test 'equal))) (while (re-search-forward "." nil t) (puthash (match-string 0) (1+ (gethash (match-string 0) freq 0)) freq)) (pop-to-buffer "*frequency*") (erase-buffer) (maphash '(lambda (key value) (insert key " " (number-to-string value) "\n")) freq)) (sort-numeric-fields -1 (point-min) (point-max)) (reverse-region (point-min) (point-max)) (other-window 1))) > > What'd be nice is something that went through the whole > buffer *once*, doing the "right thing" with each > non-ascii char. > > Do I make any sense? Or do I not really understand? Yes it makes sense. Have a look at iso-cvt.el. This package provides commands to handle iso8859-1 characters. You can find there a function called iso-translate-conventions. This function translates character according to a translation table. I am not aware of a table giving an ascii translation for all utf-8 characters, so you will have to make your own, along the lines of (defvar my-iso-trans-tab '(("à" "a") ("é" "e") ("ß" "s") ("ñ" "~n")) "Translation table for translating some character to ascii. This table is not exhaustive.") Then, assuming you have executed iso-translate-conventions from iso-cvt.el, use the following command to translate the selected region. (defun my-iso-all2ascii (from to &optional buffer) "Translate to ascii characters. Translate the region between FROM and TO using the table `my-iso-trans-tab'. Optional arg BUFFER is ignored (for use in `format-alist')." (interactive "*r") (iso-translate-conventions from to my-iso-trans-tab)) Hope that helps ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: how to scan file for non-ascii chars (eg cut-n-paste from ms-word) 2011-01-18 19:54 ` David Combs 2011-01-18 21:32 ` harven @ 2011-01-19 4:38 ` Teemu Likonen 2011-01-20 6:57 ` Kevin Rodgers 1 sibling, 1 reply; 12+ messages in thread From: Teemu Likonen @ 2011-01-19 4:38 UTC (permalink / raw) To: David Combs; +Cc: help-gnu-emacs * 2011-01-18 19:54 (UTC), David Combs wrote: >>> What I'd like to do is change those "strange" characters to their >>> plain-ascii "equivalent", so to speak. Like '"' for double quote >>> (left OR right), etc. > What'd be nice is something that went through the whole buffer *once*, > doing the "right thing" with each non-ascii char. I don't know if Linux system's "iconv" utility is available in other operating systems but at least GNU/Linux users could do it with this: C-u M-x shell-command-on-region RET iconv -t ascii//translit RET with the region being the whole buffer. RET means <Enter> key, in Emacs speak. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: how to scan file for non-ascii chars (eg cut-n-paste from ms-word) 2011-01-19 4:38 ` Teemu Likonen @ 2011-01-20 6:57 ` Kevin Rodgers 0 siblings, 0 replies; 12+ messages in thread From: Kevin Rodgers @ 2011-01-20 6:57 UTC (permalink / raw) To: help-gnu-emacs On 1/18/11 9:38 PM, Teemu Likonen wrote: > C-u M-x shell-command-on-region RET iconv -t ascii//translit RET > > with the region being the whole buffer. `C-x h' before `C-u M-x shell-command-on-region ...' -- Kevin Rodgers Denver, Colorado, USA ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2011-01-20 6:57 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-01-09 0:53 how to scan file for non-ascii chars (eg cut-n-paste from ms-word) David Combs 2011-01-09 14:23 ` Eli Zaretskii 2011-01-09 16:24 ` Kenneth Goldman 2011-01-09 17:30 ` Eli Zaretskii 2011-01-09 17:53 ` how to scan file for non-ascii chars (egcut-n-paste " Drew Adams [not found] ` <mailman.8.1294595713.11727.help-gnu-emacs@gnu.org> 2011-01-18 20:19 ` David Combs 2011-01-19 2:02 ` how to scan file for non-ascii chars(egcut-n-paste " Drew Adams [not found] ` <mailman.5.1294594249.11727.help-gnu-emacs@gnu.org> 2011-01-18 20:06 ` how to scan file for non-ascii chars (eg cut-n-paste " David Combs [not found] ` <mailman.11.1294583034.18702.help-gnu-emacs@gnu.org> 2011-01-18 19:54 ` David Combs 2011-01-18 21:32 ` harven 2011-01-19 4:38 ` Teemu Likonen 2011-01-20 6:57 ` Kevin Rodgers
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).