unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed
* character encoding
@ 2002-10-20 19:01 Hugh Lawson
  2002-10-21  7:36 ` Charles Muller
       [not found] ` <mailman.1035185597.6743.help-gnu-emacs@gnu.org>
  0 siblings, 2 replies; 5+ messages in thread
From: Hugh Lawson @ 2002-10-20 19:01 UTC (permalink / raw)



Sometimes when I cut and past "it's" from a web page into an emacs
buffer it transfers as "it?s".  Ditto for other similar events.

How do I fix this?  What do I need to study?


-- 
Hugh Lawson
hlawson@triad.rr.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: character encoding
  2002-10-20 19:01 character encoding Hugh Lawson
@ 2002-10-21  7:36 ` Charles Muller
       [not found] ` <mailman.1035185597.6743.help-gnu-emacs@gnu.org>
  1 sibling, 0 replies; 5+ messages in thread
From: Charles Muller @ 2002-10-21  7:36 UTC (permalink / raw)
  Cc: help-gnu-emacs

Hugh wrote:

> 
> Sometimes when I cut and past "it's" from a web page into an emacs
> buffer it transfers as "it?s".  Ditto for other similar events.

Most likely these are "curly" apostrophes that are inserted when people
publish HTML by first writing it in a word processor like Word or Word
Perfect, which use curly apostrophes and quotation marks by default, rather
than standard, straight apostrophes. I suspect that if you copy this text
into any other plain text editor (not only Emacs) the same thing will
happen.

I am working on a couple of different web projects where people submit these
kinds of word processor generated HTML documents, and it is a real pain in
the neck to have to clean them up all the time.

Chuck

---------------------------
Charles Muller  <acmuller@gol.com>
Faculty of Humanities,  Toyo Gakuen University
Digital Dictionary of Buddhism and CJKV-English Dictionary 
[http://www.acmuller.net]
Mobile Phone: 090-9310-1787

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: character encoding
       [not found] ` <mailman.1035185597.6743.help-gnu-emacs@gnu.org>
@ 2002-10-26 17:34   ` Ivan Kanis
  2002-10-26 19:13     ` Michael Slass
  0 siblings, 1 reply; 5+ messages in thread
From: Ivan Kanis @ 2002-10-26 17:34 UTC (permalink / raw)



    Charles> Hugh wrote:
    >>
    >> Sometimes when I cut and past "it's" from a web page into an
    >> emacs buffer it transfers as "it?s".  Ditto for other similar
    >> events.

    Charles> Most likely these are "curly" apostrophes that are
    Charles> inserted when people publish HTML by first writing it in
    Charles> a word processor like Word or Word Perfect, which use

I agree it's a real pain. This program strips all of that nonsense. It
won't work for copy/paste problem but it will work for big chunk of
text. It turns windows-1252 encoding to iso-8859-1 encoding. Basically
one has to convert the crap Microsoft inserted between 0x80 to 0x9f
into something standard.

I know it's in C. If someone cares to turn this into lisp that'll be neat :)

Ivan


#include "stdio.h"

char *table [] =  {
    "euro",  /* 0x80 0x20AC  #EURO SIGN */
    "",      /* 0x81          #UNDEFINED */
    "\"",    /* 0x82  0x201A  #SINGLE LOW-9 QUOTATION MARK */
    "f",     /* 0x83  0x0192  #LATIN SMALL LETTER F WITH HOOK */
    "\"",    /* 0x84  0x201E  #DOUBLE LOW-9 QUOTATION MARK */
    "...",   /* 0x85  0x2026  #HORIZONTAL ELLIPSIS */
    "*",     /* 0x86  0x2020  #DAGGER */
    "*",     /* 0x87  0x2021  #DOUBLE DAGGER */
    "^",     /* 0x88  0x02C6  #MODIFIER LETTER CIRCUMFLEX ACCENT */
    " 0/00", /* 0x89  0x2030  #PER MILLE SIGN */
    "S",     /* 0x8A  0x0160  #LATIN CAPITAL LETTER S WITH CARON */
    "<",     /* 0x8B  0x2039  #SINGLE LEFT-POINTING ANGLE QUOTATION MARK */
    "OE",    /* 0x8C  0x0152  #LATIN CAPITAL LIGATURE OE */
    "",      /* 0x8D          #UNDEFINED */
    "Z",     /* 0x8E  0x017D  #LATIN CAPITAL LETTER Z WITH CARON */
    "",      /* 0x8F          #UNDEFINED */
    "",      /* 0x90          #UNDEFINED */
    "'",     /* 0x91  0x2018  #LEFT SINGLE QUOTATION MARK */
    "'",     /* 0x92  0x2019  #RIGHT SINGLE QUOTATION MARK */
    "\"",    /* 0x93  0x201C  #LEFT DOUBLE QUOTATION MARK */
    "\"",    /* 0x94  0x201D  #RIGHT DOUBLE QUOTATION MARK */
    "*",     /* 0x95  0x2022  #BULLET */
    "-",     /* 0x96  0x2013  #EN DASH */
    "-",     /* 0x97  0x2014  #EM DASH */
    "~",     /* 0x98  0x02DC  #SMALL TILDE */
    "(TM)",  /* 0x99  0x2122  #TRADE MARK SIGN */
    "s",     /* 0x9A  0x0161  #LATIN SMALL LETTER S WITH CARON */
    "\"",    /* 0x9B  0x203A  #SINGLE RIGHT-POINTING ANGLE QUOTATION MARK */
    "oe",    /* 0x9C  0x0153  #LATIN SMALL LIGATURE OE */
    "",      /* 0x9D          #UNDEFINED */
    "z",     /* 0x9E  0x017E  #LATIN SMALL LETTER Z WITH CARON */
    "y"     /* 0x9F  0x0178  #LATIN CAPITAL LETTER Y WITH DIAERESIS */
};


int main (int argc, char **argv) {
    FILE *fd;
    unsigned char in;
    
    if (argc == 2) {
        if ((fd = fopen(argv[1], "r"))) {
            while (fread(&in, 1, sizeof(char), fd)) {
                if (in >= 0x80 && in < 0xa0) {
                    printf ("%s", table[in-0x80]);
                } else {
                    printf("%c", in);
                }
            }
            fclose (fd);
        }
    }
    return 0;
}



-- 
/-----------------------------------------------------------------------------*
|    "I shall never make a new friend in my life,    |       Ivan Kanis       |
|    though perhaps a few after I die."              |    ivank@juliva.com    |
|    (Oscar Wilde)                                   |     www.juliva.com     |
*-----------------------------------------------------------------------------/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: character encoding
  2002-10-26 17:34   ` Ivan Kanis
@ 2002-10-26 19:13     ` Michael Slass
  2002-10-26 20:05       ` Michael Slass
  0 siblings, 1 reply; 5+ messages in thread
From: Michael Slass @ 2002-10-26 19:13 UTC (permalink / raw)


Ivan Kanis <ivank@juliva.com> writes:

>
>I know it's in C. If someone cares to turn this into lisp that'll be neat :)
>
>Ivan
>
>
>#include "stdio.h"
>
>char *table [] =  {
>    "euro",  /* 0x80 0x20AC  #EURO SIGN */
>    "",      /* 0x81          #UNDEFINED */
>    "\"",    /* 0x82  0x201A  #SINGLE LOW-9 QUOTATION MARK */
>    "f",     /* 0x83  0x0192  #LATIN SMALL LETTER F WITH HOOK */
>    "\"",    /* 0x84  0x201E  #DOUBLE LOW-9 QUOTATION MARK */
>    "...",   /* 0x85  0x2026  #HORIZONTAL ELLIPSIS */
>    "*",     /* 0x86  0x2020  #DAGGER */
>    "*",     /* 0x87  0x2021  #DOUBLE DAGGER */
>    "^",     /* 0x88  0x02C6  #MODIFIER LETTER CIRCUMFLEX ACCENT */
>    " 0/00", /* 0x89  0x2030  #PER MILLE SIGN */
>    "S",     /* 0x8A  0x0160  #LATIN CAPITAL LETTER S WITH CARON */
>    "<",     /* 0x8B  0x2039  #SINGLE LEFT-POINTING ANGLE QUOTATION MARK */
>    "OE",    /* 0x8C  0x0152  #LATIN CAPITAL LIGATURE OE */
>    "",      /* 0x8D          #UNDEFINED */
>    "Z",     /* 0x8E  0x017D  #LATIN CAPITAL LETTER Z WITH CARON */
>    "",      /* 0x8F          #UNDEFINED */
>    "",      /* 0x90          #UNDEFINED */
>    "'",     /* 0x91  0x2018  #LEFT SINGLE QUOTATION MARK */
>    "'",     /* 0x92  0x2019  #RIGHT SINGLE QUOTATION MARK */
>    "\"",    /* 0x93  0x201C  #LEFT DOUBLE QUOTATION MARK */
>    "\"",    /* 0x94  0x201D  #RIGHT DOUBLE QUOTATION MARK */
>    "*",     /* 0x95  0x2022  #BULLET */
>    "-",     /* 0x96  0x2013  #EN DASH */
>    "-",     /* 0x97  0x2014  #EM DASH */
>    "~",     /* 0x98  0x02DC  #SMALL TILDE */
>    "(TM)",  /* 0x99  0x2122  #TRADE MARK SIGN */
>    "s",     /* 0x9A  0x0161  #LATIN SMALL LETTER S WITH CARON */
>    "\"",    /* 0x9B  0x203A  #SINGLE RIGHT-POINTING ANGLE QUOTATION MARK */
>    "oe",    /* 0x9C  0x0153  #LATIN SMALL LIGATURE OE */
>    "",      /* 0x9D          #UNDEFINED */
>    "z",     /* 0x9E  0x017E  #LATIN SMALL LETTER Z WITH CARON */
>    "y"     /* 0x9F  0x0178  #LATIN CAPITAL LETTER Y WITH DIAERESIS */
>};
>
>
>int main (int argc, char **argv) {
>    FILE *fd;
>    unsigned char in;
>    
>    if (argc == 2) {
>        if ((fd = fopen(argv[1], "r"))) {
>            while (fread(&in, 1, sizeof(char), fd)) {
>                if (in >= 0x80 && in < 0xa0) {
>                    printf ("%s", table[in-0x80]);
>                } else {
>                    printf("%c", in);
>                }
>            }
>            fclose (fd);
>        }
>    }
>    return 0;
>}

Ivan:

I can't resist that challenge.  Here's a first cut, almost completely
untested, because I don't have any Outlook-born mail to test it on.

(defvar de-microsquish-translation-alist
   '(( ?\x80 . "euro" )
     ( ?\x81 . "")
     ( ?\x82 . "\"" )
     ( ?\x83 . "f" )
     ( ?\x84 . "\"" )
     ( ?\x85 . "..." )
     ( ?\x86 . "*" )
     ( ?\x87 . "*" )
     ( ?\x88 . "^" )
     ( ?\x89 . " 0/00" )
     ( ?\x8A . "S" )
     ( ?\x8B . "<" )
     ( ?\x8C . "OE" )
     ( ?\x8E . "Z" )
     ( ?\x8F . "" )
     ( ?\x90 . "" )
     ( ?\x91 . "'" )
     ( ?\x92 . "'" )
     ( ?\x93 . "" )
     ( ?\x94 . "" )
     ( ?\x95 . "*" )
     ( ?\x96 . "-" )
     ( ?\x97 . "-" )
     ( ?\x98 . "~" )
     ( ?\x99 . "(TM)" )
     ( ?\x9A . "s" )
     ( ?\x9B . "\"" )
     ( ?\x9C . "oe" )
     ( ?\x9D . "" )
     ( ?\x9E . "z" )
     ( ?\x9F . "y" ))
"Table of hex values and replacement strings for unprintable Micro$oft chars.
See also `de-microsquish-region'.")


(defun de-microsquish-region (beg end)
  "Translate Micro$oft characters according to `de-microsquish-translation-alist'"
  (interactive "r")
  (save-restriction
    (narrow-to-region beg end)
    (goto-char (point-min))
    (while (not (eobp))
      (let* ((char (char-after))
             (replacement-cell (assoc char de-microsquish-translation-alist))
             (replacement (and replacement-cell (cdr replacement-cell))))
        (if (not replacement)
            (forward-char 1)
          (delete-char 1)
          (insert replacement))))))


-- 
Mike Slass

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: character encoding
  2002-10-26 19:13     ` Michael Slass
@ 2002-10-26 20:05       ` Michael Slass
  0 siblings, 0 replies; 5+ messages in thread
From: Michael Slass @ 2002-10-26 20:05 UTC (permalink / raw)


Michael Slass <miknrene@drizzle.com> writes:

>Ivan Kanis <ivank@juliva.com> writes:
>
>>
>>I know it's in C. If someone cares to turn this into lisp that'll be neat :)
>>

Responding to my own post.  There's also some code in gnus to do this:

See:
article-treat-dumbquotes
article-translate-characters
article-translate-strings

one of those should work with a modified version of the table I posted
in the previous reply.

-- 
Mike Slass

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2002-10-26 20:05 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-10-20 19:01 character encoding Hugh Lawson
2002-10-21  7:36 ` Charles Muller
     [not found] ` <mailman.1035185597.6743.help-gnu-emacs@gnu.org>
2002-10-26 17:34   ` Ivan Kanis
2002-10-26 19:13     ` Michael Slass
2002-10-26 20:05       ` Michael Slass

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).