From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Ivan Kanis Newsgroups: gmane.emacs.help Subject: Re: character encoding Date: 26 Oct 2002 19:34:35 +0200 Organization: Wanadoo, l'internet avec France Telecom Sender: help-gnu-emacs-admin@gnu.org Message-ID: <87eladhzo4.fsf@juliva.com> References: <87smz13p05.fsf@localhost.localdomain> NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: main.gmane.org 1035654109 31964 80.91.224.249 (26 Oct 2002 17:41:49 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Sat, 26 Oct 2002 17:41:49 +0000 (UTC) Return-path: Original-Received: from monty-python.gnu.org ([199.232.76.173]) by main.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 185UwU-0008J8-00 for ; Sat, 26 Oct 2002 19:41:47 +0200 Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.10) id 185Uwb-0007yJ-00; Sat, 26 Oct 2002 13:41:53 -0400 Original-Path: shelby.stanford.edu!newsfeed.stanford.edu!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!newsfeed.icl.net!newsfeed.fjserv.net!opentransit.net!wanadoo.fr!127.0.0.1!nobody Original-Newsgroups: gnu.emacs.help Original-Lines: 87 Original-NNTP-Posting-Host: mix-poitiers-103-1-185.abo.wanadoo.fr Original-X-Trace: news-reader11.wanadoo.fr 1035654006 17833 193.250.108.185 (26 Oct 2002 17:40:06 GMT) Original-X-Complaints-To: abuse@wanadoo.fr Original-NNTP-Posting-Date: 26 Oct 2002 17:40:06 GMT User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.2 Original-Xref: shelby.stanford.edu gnu.emacs.help:106423 Original-To: help-gnu-emacs@gnu.org Errors-To: help-gnu-emacs-admin@gnu.org X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.0.11 Precedence: bulk List-Help: List-Post: List-Subscribe: , List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: Xref: main.gmane.org gmane.emacs.help:2972 X-Report-Spam: http://spam.gmane.org/gmane.emacs.help:2972 Charles> Hugh wrote: >> >> Sometimes when I cut and past "it's" from a web page into an >> emacs buffer it transfers as "it?s". Ditto for other similar >> events. Charles> Most likely these are "curly" apostrophes that are Charles> inserted when people publish HTML by first writing it in Charles> a word processor like Word or Word Perfect, which use I agree it's a real pain. This program strips all of that nonsense. It won't work for copy/paste problem but it will work for big chunk of text. It turns windows-1252 encoding to iso-8859-1 encoding. Basically one has to convert the crap Microsoft inserted between 0x80 to 0x9f into something standard. I know it's in C. If someone cares to turn this into lisp that'll be neat :) Ivan #include "stdio.h" char *table [] = { "euro", /* 0x80 0x20AC #EURO SIGN */ "", /* 0x81 #UNDEFINED */ "\"", /* 0x82 0x201A #SINGLE LOW-9 QUOTATION MARK */ "f", /* 0x83 0x0192 #LATIN SMALL LETTER F WITH HOOK */ "\"", /* 0x84 0x201E #DOUBLE LOW-9 QUOTATION MARK */ "...", /* 0x85 0x2026 #HORIZONTAL ELLIPSIS */ "*", /* 0x86 0x2020 #DAGGER */ "*", /* 0x87 0x2021 #DOUBLE DAGGER */ "^", /* 0x88 0x02C6 #MODIFIER LETTER CIRCUMFLEX ACCENT */ " 0/00", /* 0x89 0x2030 #PER MILLE SIGN */ "S", /* 0x8A 0x0160 #LATIN CAPITAL LETTER S WITH CARON */ "<", /* 0x8B 0x2039 #SINGLE LEFT-POINTING ANGLE QUOTATION MARK */ "OE", /* 0x8C 0x0152 #LATIN CAPITAL LIGATURE OE */ "", /* 0x8D #UNDEFINED */ "Z", /* 0x8E 0x017D #LATIN CAPITAL LETTER Z WITH CARON */ "", /* 0x8F #UNDEFINED */ "", /* 0x90 #UNDEFINED */ "'", /* 0x91 0x2018 #LEFT SINGLE QUOTATION MARK */ "'", /* 0x92 0x2019 #RIGHT SINGLE QUOTATION MARK */ "\"", /* 0x93 0x201C #LEFT DOUBLE QUOTATION MARK */ "\"", /* 0x94 0x201D #RIGHT DOUBLE QUOTATION MARK */ "*", /* 0x95 0x2022 #BULLET */ "-", /* 0x96 0x2013 #EN DASH */ "-", /* 0x97 0x2014 #EM DASH */ "~", /* 0x98 0x02DC #SMALL TILDE */ "(TM)", /* 0x99 0x2122 #TRADE MARK SIGN */ "s", /* 0x9A 0x0161 #LATIN SMALL LETTER S WITH CARON */ "\"", /* 0x9B 0x203A #SINGLE RIGHT-POINTING ANGLE QUOTATION MARK */ "oe", /* 0x9C 0x0153 #LATIN SMALL LIGATURE OE */ "", /* 0x9D #UNDEFINED */ "z", /* 0x9E 0x017E #LATIN SMALL LETTER Z WITH CARON */ "y" /* 0x9F 0x0178 #LATIN CAPITAL LETTER Y WITH DIAERESIS */ }; int main (int argc, char **argv) { FILE *fd; unsigned char in; if (argc == 2) { if ((fd = fopen(argv[1], "r"))) { while (fread(&in, 1, sizeof(char), fd)) { if (in >= 0x80 && in < 0xa0) { printf ("%s", table[in-0x80]); } else { printf("%c", in); } } fclose (fd); } } return 0; } -- /-----------------------------------------------------------------------------* | "I shall never make a new friend in my life, | Ivan Kanis | | though perhaps a few after I die." | ivank@juliva.com | | (Oscar Wilde) | www.juliva.com | *-----------------------------------------------------------------------------/