* character encoding @ 2002-10-20 19:01 Hugh Lawson 2002-10-21 7:36 ` Charles Muller [not found] ` <mailman.1035185597.6743.help-gnu-emacs@gnu.org> 0 siblings, 2 replies; 5+ messages in thread From: Hugh Lawson @ 2002-10-20 19:01 UTC (permalink / raw) Sometimes when I cut and past "it's" from a web page into an emacs buffer it transfers as "it?s". Ditto for other similar events. How do I fix this? What do I need to study? -- Hugh Lawson hlawson@triad.rr.com ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: character encoding 2002-10-20 19:01 character encoding Hugh Lawson @ 2002-10-21 7:36 ` Charles Muller [not found] ` <mailman.1035185597.6743.help-gnu-emacs@gnu.org> 1 sibling, 0 replies; 5+ messages in thread From: Charles Muller @ 2002-10-21 7:36 UTC (permalink / raw) Cc: help-gnu-emacs Hugh wrote: > > Sometimes when I cut and past "it's" from a web page into an emacs > buffer it transfers as "it?s". Ditto for other similar events. Most likely these are "curly" apostrophes that are inserted when people publish HTML by first writing it in a word processor like Word or Word Perfect, which use curly apostrophes and quotation marks by default, rather than standard, straight apostrophes. I suspect that if you copy this text into any other plain text editor (not only Emacs) the same thing will happen. I am working on a couple of different web projects where people submit these kinds of word processor generated HTML documents, and it is a real pain in the neck to have to clean them up all the time. Chuck --------------------------- Charles Muller <acmuller@gol.com> Faculty of Humanities, Toyo Gakuen University Digital Dictionary of Buddhism and CJKV-English Dictionary [http://www.acmuller.net] Mobile Phone: 090-9310-1787 ^ permalink raw reply [flat|nested] 5+ messages in thread
[parent not found: <mailman.1035185597.6743.help-gnu-emacs@gnu.org>]
* Re: character encoding [not found] ` <mailman.1035185597.6743.help-gnu-emacs@gnu.org> @ 2002-10-26 17:34 ` Ivan Kanis 2002-10-26 19:13 ` Michael Slass 0 siblings, 1 reply; 5+ messages in thread From: Ivan Kanis @ 2002-10-26 17:34 UTC (permalink / raw) Charles> Hugh wrote: >> >> Sometimes when I cut and past "it's" from a web page into an >> emacs buffer it transfers as "it?s". Ditto for other similar >> events. Charles> Most likely these are "curly" apostrophes that are Charles> inserted when people publish HTML by first writing it in Charles> a word processor like Word or Word Perfect, which use I agree it's a real pain. This program strips all of that nonsense. It won't work for copy/paste problem but it will work for big chunk of text. It turns windows-1252 encoding to iso-8859-1 encoding. Basically one has to convert the crap Microsoft inserted between 0x80 to 0x9f into something standard. I know it's in C. If someone cares to turn this into lisp that'll be neat :) Ivan #include "stdio.h" char *table [] = { "euro", /* 0x80 0x20AC #EURO SIGN */ "", /* 0x81 #UNDEFINED */ "\"", /* 0x82 0x201A #SINGLE LOW-9 QUOTATION MARK */ "f", /* 0x83 0x0192 #LATIN SMALL LETTER F WITH HOOK */ "\"", /* 0x84 0x201E #DOUBLE LOW-9 QUOTATION MARK */ "...", /* 0x85 0x2026 #HORIZONTAL ELLIPSIS */ "*", /* 0x86 0x2020 #DAGGER */ "*", /* 0x87 0x2021 #DOUBLE DAGGER */ "^", /* 0x88 0x02C6 #MODIFIER LETTER CIRCUMFLEX ACCENT */ " 0/00", /* 0x89 0x2030 #PER MILLE SIGN */ "S", /* 0x8A 0x0160 #LATIN CAPITAL LETTER S WITH CARON */ "<", /* 0x8B 0x2039 #SINGLE LEFT-POINTING ANGLE QUOTATION MARK */ "OE", /* 0x8C 0x0152 #LATIN CAPITAL LIGATURE OE */ "", /* 0x8D #UNDEFINED */ "Z", /* 0x8E 0x017D #LATIN CAPITAL LETTER Z WITH CARON */ "", /* 0x8F #UNDEFINED */ "", /* 0x90 #UNDEFINED */ "'", /* 0x91 0x2018 #LEFT SINGLE QUOTATION MARK */ "'", /* 0x92 0x2019 #RIGHT SINGLE QUOTATION MARK */ "\"", /* 0x93 0x201C #LEFT DOUBLE QUOTATION MARK */ "\"", /* 0x94 0x201D #RIGHT DOUBLE QUOTATION MARK */ "*", /* 0x95 0x2022 #BULLET */ "-", /* 0x96 0x2013 #EN DASH */ "-", /* 0x97 0x2014 #EM DASH */ "~", /* 0x98 0x02DC #SMALL TILDE */ "(TM)", /* 0x99 0x2122 #TRADE MARK SIGN */ "s", /* 0x9A 0x0161 #LATIN SMALL LETTER S WITH CARON */ "\"", /* 0x9B 0x203A #SINGLE RIGHT-POINTING ANGLE QUOTATION MARK */ "oe", /* 0x9C 0x0153 #LATIN SMALL LIGATURE OE */ "", /* 0x9D #UNDEFINED */ "z", /* 0x9E 0x017E #LATIN SMALL LETTER Z WITH CARON */ "y" /* 0x9F 0x0178 #LATIN CAPITAL LETTER Y WITH DIAERESIS */ }; int main (int argc, char **argv) { FILE *fd; unsigned char in; if (argc == 2) { if ((fd = fopen(argv[1], "r"))) { while (fread(&in, 1, sizeof(char), fd)) { if (in >= 0x80 && in < 0xa0) { printf ("%s", table[in-0x80]); } else { printf("%c", in); } } fclose (fd); } } return 0; } -- /-----------------------------------------------------------------------------* | "I shall never make a new friend in my life, | Ivan Kanis | | though perhaps a few after I die." | ivank@juliva.com | | (Oscar Wilde) | www.juliva.com | *-----------------------------------------------------------------------------/ ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: character encoding 2002-10-26 17:34 ` Ivan Kanis @ 2002-10-26 19:13 ` Michael Slass 2002-10-26 20:05 ` Michael Slass 0 siblings, 1 reply; 5+ messages in thread From: Michael Slass @ 2002-10-26 19:13 UTC (permalink / raw) Ivan Kanis <ivank@juliva.com> writes: > >I know it's in C. If someone cares to turn this into lisp that'll be neat :) > >Ivan > > >#include "stdio.h" > >char *table [] = { > "euro", /* 0x80 0x20AC #EURO SIGN */ > "", /* 0x81 #UNDEFINED */ > "\"", /* 0x82 0x201A #SINGLE LOW-9 QUOTATION MARK */ > "f", /* 0x83 0x0192 #LATIN SMALL LETTER F WITH HOOK */ > "\"", /* 0x84 0x201E #DOUBLE LOW-9 QUOTATION MARK */ > "...", /* 0x85 0x2026 #HORIZONTAL ELLIPSIS */ > "*", /* 0x86 0x2020 #DAGGER */ > "*", /* 0x87 0x2021 #DOUBLE DAGGER */ > "^", /* 0x88 0x02C6 #MODIFIER LETTER CIRCUMFLEX ACCENT */ > " 0/00", /* 0x89 0x2030 #PER MILLE SIGN */ > "S", /* 0x8A 0x0160 #LATIN CAPITAL LETTER S WITH CARON */ > "<", /* 0x8B 0x2039 #SINGLE LEFT-POINTING ANGLE QUOTATION MARK */ > "OE", /* 0x8C 0x0152 #LATIN CAPITAL LIGATURE OE */ > "", /* 0x8D #UNDEFINED */ > "Z", /* 0x8E 0x017D #LATIN CAPITAL LETTER Z WITH CARON */ > "", /* 0x8F #UNDEFINED */ > "", /* 0x90 #UNDEFINED */ > "'", /* 0x91 0x2018 #LEFT SINGLE QUOTATION MARK */ > "'", /* 0x92 0x2019 #RIGHT SINGLE QUOTATION MARK */ > "\"", /* 0x93 0x201C #LEFT DOUBLE QUOTATION MARK */ > "\"", /* 0x94 0x201D #RIGHT DOUBLE QUOTATION MARK */ > "*", /* 0x95 0x2022 #BULLET */ > "-", /* 0x96 0x2013 #EN DASH */ > "-", /* 0x97 0x2014 #EM DASH */ > "~", /* 0x98 0x02DC #SMALL TILDE */ > "(TM)", /* 0x99 0x2122 #TRADE MARK SIGN */ > "s", /* 0x9A 0x0161 #LATIN SMALL LETTER S WITH CARON */ > "\"", /* 0x9B 0x203A #SINGLE RIGHT-POINTING ANGLE QUOTATION MARK */ > "oe", /* 0x9C 0x0153 #LATIN SMALL LIGATURE OE */ > "", /* 0x9D #UNDEFINED */ > "z", /* 0x9E 0x017E #LATIN SMALL LETTER Z WITH CARON */ > "y" /* 0x9F 0x0178 #LATIN CAPITAL LETTER Y WITH DIAERESIS */ >}; > > >int main (int argc, char **argv) { > FILE *fd; > unsigned char in; > > if (argc == 2) { > if ((fd = fopen(argv[1], "r"))) { > while (fread(&in, 1, sizeof(char), fd)) { > if (in >= 0x80 && in < 0xa0) { > printf ("%s", table[in-0x80]); > } else { > printf("%c", in); > } > } > fclose (fd); > } > } > return 0; >} Ivan: I can't resist that challenge. Here's a first cut, almost completely untested, because I don't have any Outlook-born mail to test it on. (defvar de-microsquish-translation-alist '(( ?\x80 . "euro" ) ( ?\x81 . "") ( ?\x82 . "\"" ) ( ?\x83 . "f" ) ( ?\x84 . "\"" ) ( ?\x85 . "..." ) ( ?\x86 . "*" ) ( ?\x87 . "*" ) ( ?\x88 . "^" ) ( ?\x89 . " 0/00" ) ( ?\x8A . "S" ) ( ?\x8B . "<" ) ( ?\x8C . "OE" ) ( ?\x8E . "Z" ) ( ?\x8F . "" ) ( ?\x90 . "" ) ( ?\x91 . "'" ) ( ?\x92 . "'" ) ( ?\x93 . "" ) ( ?\x94 . "" ) ( ?\x95 . "*" ) ( ?\x96 . "-" ) ( ?\x97 . "-" ) ( ?\x98 . "~" ) ( ?\x99 . "(TM)" ) ( ?\x9A . "s" ) ( ?\x9B . "\"" ) ( ?\x9C . "oe" ) ( ?\x9D . "" ) ( ?\x9E . "z" ) ( ?\x9F . "y" )) "Table of hex values and replacement strings for unprintable Micro$oft chars. See also `de-microsquish-region'.") (defun de-microsquish-region (beg end) "Translate Micro$oft characters according to `de-microsquish-translation-alist'" (interactive "r") (save-restriction (narrow-to-region beg end) (goto-char (point-min)) (while (not (eobp)) (let* ((char (char-after)) (replacement-cell (assoc char de-microsquish-translation-alist)) (replacement (and replacement-cell (cdr replacement-cell)))) (if (not replacement) (forward-char 1) (delete-char 1) (insert replacement)))))) -- Mike Slass ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: character encoding 2002-10-26 19:13 ` Michael Slass @ 2002-10-26 20:05 ` Michael Slass 0 siblings, 0 replies; 5+ messages in thread From: Michael Slass @ 2002-10-26 20:05 UTC (permalink / raw) Michael Slass <miknrene@drizzle.com> writes: >Ivan Kanis <ivank@juliva.com> writes: > >> >>I know it's in C. If someone cares to turn this into lisp that'll be neat :) >> Responding to my own post. There's also some code in gnus to do this: See: article-treat-dumbquotes article-translate-characters article-translate-strings one of those should work with a modified version of the table I posted in the previous reply. -- Mike Slass ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2002-10-26 20:05 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2002-10-20 19:01 character encoding Hugh Lawson 2002-10-21 7:36 ` Charles Muller [not found] ` <mailman.1035185597.6743.help-gnu-emacs@gnu.org> 2002-10-26 17:34 ` Ivan Kanis 2002-10-26 19:13 ` Michael Slass 2002-10-26 20:05 ` Michael Slass
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).