* character encoding
@ 2002-10-20 19:01 Hugh Lawson
2002-10-21 7:36 ` Charles Muller
[not found] ` <mailman.1035185597.6743.help-gnu-emacs@gnu.org>
0 siblings, 2 replies; 5+ messages in thread
From: Hugh Lawson @ 2002-10-20 19:01 UTC (permalink / raw)
Sometimes when I cut and past "it's" from a web page into an emacs
buffer it transfers as "it?s". Ditto for other similar events.
How do I fix this? What do I need to study?
--
Hugh Lawson
hlawson@triad.rr.com
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: character encoding
2002-10-20 19:01 character encoding Hugh Lawson
@ 2002-10-21 7:36 ` Charles Muller
[not found] ` <mailman.1035185597.6743.help-gnu-emacs@gnu.org>
1 sibling, 0 replies; 5+ messages in thread
From: Charles Muller @ 2002-10-21 7:36 UTC (permalink / raw)
Cc: help-gnu-emacs
Hugh wrote:
>
> Sometimes when I cut and past "it's" from a web page into an emacs
> buffer it transfers as "it?s". Ditto for other similar events.
Most likely these are "curly" apostrophes that are inserted when people
publish HTML by first writing it in a word processor like Word or Word
Perfect, which use curly apostrophes and quotation marks by default, rather
than standard, straight apostrophes. I suspect that if you copy this text
into any other plain text editor (not only Emacs) the same thing will
happen.
I am working on a couple of different web projects where people submit these
kinds of word processor generated HTML documents, and it is a real pain in
the neck to have to clean them up all the time.
Chuck
---------------------------
Charles Muller <acmuller@gol.com>
Faculty of Humanities, Toyo Gakuen University
Digital Dictionary of Buddhism and CJKV-English Dictionary
[http://www.acmuller.net]
Mobile Phone: 090-9310-1787
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: character encoding
[not found] ` <mailman.1035185597.6743.help-gnu-emacs@gnu.org>
@ 2002-10-26 17:34 ` Ivan Kanis
2002-10-26 19:13 ` Michael Slass
0 siblings, 1 reply; 5+ messages in thread
From: Ivan Kanis @ 2002-10-26 17:34 UTC (permalink / raw)
Charles> Hugh wrote:
>>
>> Sometimes when I cut and past "it's" from a web page into an
>> emacs buffer it transfers as "it?s". Ditto for other similar
>> events.
Charles> Most likely these are "curly" apostrophes that are
Charles> inserted when people publish HTML by first writing it in
Charles> a word processor like Word or Word Perfect, which use
I agree it's a real pain. This program strips all of that nonsense. It
won't work for copy/paste problem but it will work for big chunk of
text. It turns windows-1252 encoding to iso-8859-1 encoding. Basically
one has to convert the crap Microsoft inserted between 0x80 to 0x9f
into something standard.
I know it's in C. If someone cares to turn this into lisp that'll be neat :)
Ivan
#include "stdio.h"
char *table [] = {
"euro", /* 0x80 0x20AC #EURO SIGN */
"", /* 0x81 #UNDEFINED */
"\"", /* 0x82 0x201A #SINGLE LOW-9 QUOTATION MARK */
"f", /* 0x83 0x0192 #LATIN SMALL LETTER F WITH HOOK */
"\"", /* 0x84 0x201E #DOUBLE LOW-9 QUOTATION MARK */
"...", /* 0x85 0x2026 #HORIZONTAL ELLIPSIS */
"*", /* 0x86 0x2020 #DAGGER */
"*", /* 0x87 0x2021 #DOUBLE DAGGER */
"^", /* 0x88 0x02C6 #MODIFIER LETTER CIRCUMFLEX ACCENT */
" 0/00", /* 0x89 0x2030 #PER MILLE SIGN */
"S", /* 0x8A 0x0160 #LATIN CAPITAL LETTER S WITH CARON */
"<", /* 0x8B 0x2039 #SINGLE LEFT-POINTING ANGLE QUOTATION MARK */
"OE", /* 0x8C 0x0152 #LATIN CAPITAL LIGATURE OE */
"", /* 0x8D #UNDEFINED */
"Z", /* 0x8E 0x017D #LATIN CAPITAL LETTER Z WITH CARON */
"", /* 0x8F #UNDEFINED */
"", /* 0x90 #UNDEFINED */
"'", /* 0x91 0x2018 #LEFT SINGLE QUOTATION MARK */
"'", /* 0x92 0x2019 #RIGHT SINGLE QUOTATION MARK */
"\"", /* 0x93 0x201C #LEFT DOUBLE QUOTATION MARK */
"\"", /* 0x94 0x201D #RIGHT DOUBLE QUOTATION MARK */
"*", /* 0x95 0x2022 #BULLET */
"-", /* 0x96 0x2013 #EN DASH */
"-", /* 0x97 0x2014 #EM DASH */
"~", /* 0x98 0x02DC #SMALL TILDE */
"(TM)", /* 0x99 0x2122 #TRADE MARK SIGN */
"s", /* 0x9A 0x0161 #LATIN SMALL LETTER S WITH CARON */
"\"", /* 0x9B 0x203A #SINGLE RIGHT-POINTING ANGLE QUOTATION MARK */
"oe", /* 0x9C 0x0153 #LATIN SMALL LIGATURE OE */
"", /* 0x9D #UNDEFINED */
"z", /* 0x9E 0x017E #LATIN SMALL LETTER Z WITH CARON */
"y" /* 0x9F 0x0178 #LATIN CAPITAL LETTER Y WITH DIAERESIS */
};
int main (int argc, char **argv) {
FILE *fd;
unsigned char in;
if (argc == 2) {
if ((fd = fopen(argv[1], "r"))) {
while (fread(&in, 1, sizeof(char), fd)) {
if (in >= 0x80 && in < 0xa0) {
printf ("%s", table[in-0x80]);
} else {
printf("%c", in);
}
}
fclose (fd);
}
}
return 0;
}
--
/-----------------------------------------------------------------------------*
| "I shall never make a new friend in my life, | Ivan Kanis |
| though perhaps a few after I die." | ivank@juliva.com |
| (Oscar Wilde) | www.juliva.com |
*-----------------------------------------------------------------------------/
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: character encoding
2002-10-26 17:34 ` Ivan Kanis
@ 2002-10-26 19:13 ` Michael Slass
2002-10-26 20:05 ` Michael Slass
0 siblings, 1 reply; 5+ messages in thread
From: Michael Slass @ 2002-10-26 19:13 UTC (permalink / raw)
Ivan Kanis <ivank@juliva.com> writes:
>
>I know it's in C. If someone cares to turn this into lisp that'll be neat :)
>
>Ivan
>
>
>#include "stdio.h"
>
>char *table [] = {
> "euro", /* 0x80 0x20AC #EURO SIGN */
> "", /* 0x81 #UNDEFINED */
> "\"", /* 0x82 0x201A #SINGLE LOW-9 QUOTATION MARK */
> "f", /* 0x83 0x0192 #LATIN SMALL LETTER F WITH HOOK */
> "\"", /* 0x84 0x201E #DOUBLE LOW-9 QUOTATION MARK */
> "...", /* 0x85 0x2026 #HORIZONTAL ELLIPSIS */
> "*", /* 0x86 0x2020 #DAGGER */
> "*", /* 0x87 0x2021 #DOUBLE DAGGER */
> "^", /* 0x88 0x02C6 #MODIFIER LETTER CIRCUMFLEX ACCENT */
> " 0/00", /* 0x89 0x2030 #PER MILLE SIGN */
> "S", /* 0x8A 0x0160 #LATIN CAPITAL LETTER S WITH CARON */
> "<", /* 0x8B 0x2039 #SINGLE LEFT-POINTING ANGLE QUOTATION MARK */
> "OE", /* 0x8C 0x0152 #LATIN CAPITAL LIGATURE OE */
> "", /* 0x8D #UNDEFINED */
> "Z", /* 0x8E 0x017D #LATIN CAPITAL LETTER Z WITH CARON */
> "", /* 0x8F #UNDEFINED */
> "", /* 0x90 #UNDEFINED */
> "'", /* 0x91 0x2018 #LEFT SINGLE QUOTATION MARK */
> "'", /* 0x92 0x2019 #RIGHT SINGLE QUOTATION MARK */
> "\"", /* 0x93 0x201C #LEFT DOUBLE QUOTATION MARK */
> "\"", /* 0x94 0x201D #RIGHT DOUBLE QUOTATION MARK */
> "*", /* 0x95 0x2022 #BULLET */
> "-", /* 0x96 0x2013 #EN DASH */
> "-", /* 0x97 0x2014 #EM DASH */
> "~", /* 0x98 0x02DC #SMALL TILDE */
> "(TM)", /* 0x99 0x2122 #TRADE MARK SIGN */
> "s", /* 0x9A 0x0161 #LATIN SMALL LETTER S WITH CARON */
> "\"", /* 0x9B 0x203A #SINGLE RIGHT-POINTING ANGLE QUOTATION MARK */
> "oe", /* 0x9C 0x0153 #LATIN SMALL LIGATURE OE */
> "", /* 0x9D #UNDEFINED */
> "z", /* 0x9E 0x017E #LATIN SMALL LETTER Z WITH CARON */
> "y" /* 0x9F 0x0178 #LATIN CAPITAL LETTER Y WITH DIAERESIS */
>};
>
>
>int main (int argc, char **argv) {
> FILE *fd;
> unsigned char in;
>
> if (argc == 2) {
> if ((fd = fopen(argv[1], "r"))) {
> while (fread(&in, 1, sizeof(char), fd)) {
> if (in >= 0x80 && in < 0xa0) {
> printf ("%s", table[in-0x80]);
> } else {
> printf("%c", in);
> }
> }
> fclose (fd);
> }
> }
> return 0;
>}
Ivan:
I can't resist that challenge. Here's a first cut, almost completely
untested, because I don't have any Outlook-born mail to test it on.
(defvar de-microsquish-translation-alist
'(( ?\x80 . "euro" )
( ?\x81 . "")
( ?\x82 . "\"" )
( ?\x83 . "f" )
( ?\x84 . "\"" )
( ?\x85 . "..." )
( ?\x86 . "*" )
( ?\x87 . "*" )
( ?\x88 . "^" )
( ?\x89 . " 0/00" )
( ?\x8A . "S" )
( ?\x8B . "<" )
( ?\x8C . "OE" )
( ?\x8E . "Z" )
( ?\x8F . "" )
( ?\x90 . "" )
( ?\x91 . "'" )
( ?\x92 . "'" )
( ?\x93 . "" )
( ?\x94 . "" )
( ?\x95 . "*" )
( ?\x96 . "-" )
( ?\x97 . "-" )
( ?\x98 . "~" )
( ?\x99 . "(TM)" )
( ?\x9A . "s" )
( ?\x9B . "\"" )
( ?\x9C . "oe" )
( ?\x9D . "" )
( ?\x9E . "z" )
( ?\x9F . "y" ))
"Table of hex values and replacement strings for unprintable Micro$oft chars.
See also `de-microsquish-region'.")
(defun de-microsquish-region (beg end)
"Translate Micro$oft characters according to `de-microsquish-translation-alist'"
(interactive "r")
(save-restriction
(narrow-to-region beg end)
(goto-char (point-min))
(while (not (eobp))
(let* ((char (char-after))
(replacement-cell (assoc char de-microsquish-translation-alist))
(replacement (and replacement-cell (cdr replacement-cell))))
(if (not replacement)
(forward-char 1)
(delete-char 1)
(insert replacement))))))
--
Mike Slass
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: character encoding
2002-10-26 19:13 ` Michael Slass
@ 2002-10-26 20:05 ` Michael Slass
0 siblings, 0 replies; 5+ messages in thread
From: Michael Slass @ 2002-10-26 20:05 UTC (permalink / raw)
Michael Slass <miknrene@drizzle.com> writes:
>Ivan Kanis <ivank@juliva.com> writes:
>
>>
>>I know it's in C. If someone cares to turn this into lisp that'll be neat :)
>>
Responding to my own post. There's also some code in gnus to do this:
See:
article-treat-dumbquotes
article-translate-characters
article-translate-strings
one of those should work with a modified version of the table I posted
in the previous reply.
--
Mike Slass
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2002-10-26 20:05 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-10-20 19:01 character encoding Hugh Lawson
2002-10-21 7:36 ` Charles Muller
[not found] ` <mailman.1035185597.6743.help-gnu-emacs@gnu.org>
2002-10-26 17:34 ` Ivan Kanis
2002-10-26 19:13 ` Michael Slass
2002-10-26 20:05 ` Michael Slass
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).