all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Kevin Ryde <user42@zip.com.au>
Cc: rms@gnu.org
Subject: Re: gutenberg-coding.el -- coding system for Project Gutenberg files
Date: Tue, 25 Oct 2005 08:33:25 +1000	[thread overview]
Message-ID: <87oe5eeea2.fsf@zip.com.au> (raw)
In-Reply-To: E1ESdmi-0002fk-Pp@fencepost.gnu.org

[-- Attachment #1: Type: text/plain, Size: 557 bytes --]

This is my go at Project Gutenberg ebook/etext coding system detection
adapted to the emacs cvs.

The charset names in the texts are slightly free-form and need an
unhappy amount of massaging.  "list.log" below is what I grepped out
of all the current files (about 23000 of them).

Some charset names are obvious typos (I reported them), but it doesn't
hurt to allow them.

2005-10-24  Kevin Ryde  <user42@zip.com.au>

        * international/mule.el (project-gutenberg-auto-coding-function): New
        function.
        (auto-coding-functions): Add it.


[-- Attachment #2: mule.el.gutenberg.diff --]
[-- Type: text/plain, Size: 5769 bytes --]

Index: mule.el
===================================================================
RCS file: /cvsroot/emacs/emacs/lisp/international/mule.el,v
retrieving revision 1.227
diff -u -c -r1.227 mule.el
cvs server: conflicting specifications of output style
*** mule.el	23 Oct 2005 18:24:00 -0000	1.227
--- mule.el	24 Oct 2005 22:06:19 -0000
***************
*** 1588,1594 ****
  		       (symbol :tag "Coding system"))))
  
  ;; See the bottom of this file for built-in auto coding functions.
! (defcustom auto-coding-functions '(sgml-xml-auto-coding-function
  				   sgml-html-meta-auto-coding-function)
    "A list of functions which attempt to determine a coding system.
  
--- 1588,1595 ----
  		       (symbol :tag "Coding system"))))
  
  ;; See the bottom of this file for built-in auto coding functions.
! (defcustom auto-coding-functions '(project-gutenberg-auto-coding-function
! 				   sgml-xml-auto-coding-function
  				   sgml-html-meta-auto-coding-function)
    "A list of functions which attempt to determine a coding system.
  
***************
*** 2204,2209 ****
--- 2205,2307 ----
  
  
  ;;; Built-in auto-coding-functions:
+ 
+ (defun project-gutenberg-auto-coding-function (size)
+   "Determine character encoding of a Project Gutenberg EBook/Etext.
+ This function is designed for use in `auto-coding-functions'.
+ 
+ A Project Gutenberg text has \"Project Gutenberg\" in the first line, and a
+ subsequent \"Character set encoding:\" line.  The latter gives the coding
+ system.
+ 
+ Some early non-ASCII texts don't have a \"Character set encoding:\", for
+ those you have to use other Emacs mechanisms (eg. \\[universal-coding-system-argument]).
+ 
+ See http://www.gutenberg.org for more about Project Gutenberg."
+ 
+   (and (looking-at ".*Project Gutenberg")
+ 
+        ;; The regexp here is "^Cha[rt]acter set encoding: *\\(.*\\)", except
+        ;; tweaked to avoid trailing spaces and \r in the match-string.
+        ;;
+        ;; Project Gutenberg files are CRLF line endings (usually) so \r is
+        ;; normal; and trailing spaces have been seen in a few files.
+        ;;
+        ;; "Chatacter" is a typo seen in about 220 files as of 2005 (though
+        ;; only 38 are non-ASCII).
+        ;;
+        (re-search-forward
+         "^Cha[rt]acter set encoding:[ \t\r]*\\(\\([ \t\r]*[^ \t\r\n]+\\)*\\)"
+         ;; only search first 200 lines
+         (save-excursion (forward-line 200) (point))
+         t)
+ 
+        ;; The character set names are slightly free form.  They're perfectly
+        ;; understandable to a human, but need some massaging to get
+        ;; something `locale-charset-to-coding-system' can handle.  The stuff
+        ;; below was tested on the full set of files in 2005.
+        ;;
+        ;; Some readme.txt files have "MP3" or the like given as the
+        ;; character set, which is bogus, it refers to the existance of .mp3
+        ;; files, the .txt is plain ascii.  We let such cases get the warning
+        ;; message.
+ 
+        (let* ((orig-charset (match-string 1))
+               (charset      (downcase orig-charset)))
+ 
+          ;; "ascii"                 -> "us-ascii"
+          ;; "iso-646-us (us-ascii)" -> "us-ascii"
+          (if (member charset '("ascii" "iso-646-us (us-ascii)"))
+              (setq charset "us-ascii"))
+ 
+          ;; "ascii, with a few iso-8859-1 characters" etc -> "iso-8859-1"
+          ;; "acii, with some iso-8859-1 characters"       -> "iso-8859-1"
+          ;; the "acii" is a typo in dvptn10.txt, easy enough to allow it
+          (setq charset (replace-regexp-in-string
+                         "^as?cii[ (,]*with.* \\(iso-8859-[0-9]+\\).*"
+                         "\\1" charset t))
+ 
+          ;; "cp-1250"                -> "windows-1250"
+          ;; "cp1251"                 -> "windows-1251"
+          ;; "codepage 1250"          -> "windows-1250"
+          ;; "windows codepage 1252"  -> "windows-1252"
+          ;; "windows code page 1252" -> "windows-1252"
+          (setq charset (replace-regexp-in-string
+                         "^\\(cp\\|codepage\\|windows \\(code ?page\\)?\\)[ -]*"
+                         "windows-" charset t t))
+ 
+          ;; "unicode" alone -> "utf-8", found in 10752-8.txt
+          (setq charset (replace-regexp-in-string "^unicode\r?$" "utf-8"
+                                                  charset t))
+ 
+          ;; "unicode utf-8" -> "utf-8"
+          (setq charset (replace-regexp-in-string "^unicode utf" "utf"
+                                                  charset t t))
+ 
+          ;; "unicode (utf-8)" -> "utf-8"
+          (setq charset (replace-regexp-in-string "^unicode (\\(.*\\))$" "\\1"
+                                                  charset t))
+ 
+          ;; "iso-8858-1" -> "iso-8859-1", typo in 10439-8.txt
+          (setq charset (replace-regexp-in-string "8858" "8859" charset t t))
+ 
+          ;; "ido-8859-1" -> "iso-8859-1", typo in 10549-8.txt
+          (setq charset (replace-regexp-in-string "^ido-" "iso-" charset t t))
+ 
+          ;; "iso 8859-1 (latin-1)" -> "latin-1"
+          (setq charset (replace-regexp-in-string
+                         "^iso 8859-\\([0-9]+\\) (\\(latin-\\1\\))$"
+                         "\\2" charset t))
+ 
+          ;; "iso=8859-1" -> "iso-8859-1"
+          ;; "big 5"      -> "big-5"
+          (setq charset (replace-regexp-in-string "[= ]" "-" charset t t))
+ 
+          (or (locale-charset-to-coding-system charset)
+              (progn
+                (message "Warning: unknown coding system \"%s\""
+                         orig-charset)
+                nil)))))
  
  (defun sgml-xml-auto-coding-function (size)
    "Determine whether the buffer is XML, and if so, its encoding.

[-- Attachment #3: list.log --]
[-- Type: text/plain, Size: 5124 bytes --]

Character set encoding:  ASCII                  eg. kimrk12.txt
Character set encoding:  ISO8859_1              eg. c1001107.txt
Character set encoding: ACII, with some ISO-8859-1 characters
                                                eg. dvptn10.txt
Character set encoding: ASCII                   eg. 10001.txt
Character set encoding: ASCII                     
                                                eg. oh11v10.txt
Character set encoding: ASCII, with 2 ISO-8859-1 characters
                                                eg. prpsl10.txt
Character set encoding: ASCII, with a couple of ISO-8859-1 characters
                                                eg. jrcl610.txt
Character set encoding: ASCII, with a few ISO-8859-1 characters
                                                eg. cnnet10.txt
Character set encoding: ASCII (with a few ISO-8859-1 characters)
                                                eg. ltlbh10.txt
Character set encoding: ASCII, with one ISO-8859-1 character
                                                eg. srhrl10.txt
Character set encoding: ASCII, with some ISO-8859-1 characters
                                                eg. bough11.txt
Character set encoding: ASCII, with two ISO-8859-1 characters
                                                eg. prphi10.txt
Character set encoding: Big 5                   eg. dxizi10.txt
Character set encoding: BIG-5                   eg. 8dxzj10.txt
Character set encoding: Big5                    eg. wesik10.txt
Character set encoding: Codepage 1250           eg. sklep10.txt
Character set encoding: CP-1250                 eg. 13083-8.txt
Character set encoding: CP-1251                 eg. 14741-8.txt
Character set encoding: CP-1252                 eg. 12732-8.txt
Character set encoding: CP1251                  eg. 11292-8.txt
Character set encoding: cp1251                  eg. kknta10.txt
Character set encoding: CP1252                  eg. 8ledo10.txt
Character set encoding: EUC-KR                  eg. kedct10.txt
Character set encoding: IDO-8859-1              eg. 10549-8.txt
Character set encoding: ISO-646-US (US-ASCII)   eg. 107.txt
Character set encoding: ISO-8858-1              eg. 10439-8.txt
Character set encoding: ISO 8859-1              eg. 8bld410.txt
Character set encoding: ISO-8859-1              eg. 10002-8.txt
Character set encoding: iso-8859-1              eg. 10429-8.txt
Character set encoding: ISO=8859-1              eg. 7fool10.txt
Character set encoding: ISO 8859-1 (Latin-1)    eg. 8adio10.txt
Character set encoding: iso-8859-15             eg. 8dlrm10.txt
Character set encoding: ISO-8859-2              eg. rnpz810.txt
Character set encoding: ISO Latin-1             eg. 10056-8.txt
Character set encoding: ISO-LATIN-1             eg. 8nggd10.txt
Character set encoding: ISO-Latin-1             eg. hstrd10.txt
Character set encoding: iso-Latin-1             eg. 8wpwl10.txt
Character set encoding: iso-latin-1             eg. 8engl10.txt
Character set encoding: ISO8859-1               eg. 7bjrn10.txt
Character set encoding: ISO8859_1               eg. a1001107.txt
Character set encoding: KOI8-R                  eg. ktria10.txt
Character set encoding: Latin 1                 eg. divrw10.txt
Character set encoding: Latin-1                 eg. 8dawn10.txt
Character set encoding: Latin-4                 eg. kalev10.txt
Character set encoding: Latin1                  eg. 10347-8.txt
Character set encoding: MP3                     eg. 10348-m-readme.txt
Character set encoding: MPEG                    eg. atomi10m-readme.txt
Character set encoding: MPEG Layer 3 (MP3)      eg. 1donq3-readme.txt
Character set encoding: Unicode                 eg. 10752-8.txt
Character set encoding: Unicode (UTF-8)         eg. orama10u.txt
Character set encoding: Unicode UTF-8           eg. 11753-0.txt
Character set encoding: US-ASCII                eg. 10078.txt
Character set encoding: US-ASCII, MIDI, Lilypond, MP3 and TeX
                                                eg. 10535.txt
Character set encoding: UTF-16                  eg. 13083-utf16.txt
Character set encoding: UTF-7                   eg. 8cart10.txt
Character set encoding: UTF-8                   eg. 10140-0.txt
Character set encoding: utf-8                   eg. astrl10.txt
Character set encoding: UTF8                    eg. 8gslt10.txt
Character set encoding: Windows-1250            eg. 15201-8.txt
Character set encoding: Windows 1251            eg. olavg10.txt
Character set encoding: Windows-1252            eg. 8clcn10.txt
Character set encoding: Windows Code Page 1252  eg. 8tjna10.txt
Character set encoding: Windows Codepage 1252   eg. 8vepi10.txt
Character set encoding: Windows1253             eg. orama10.txt
Chatacter set encoding: ISO-8859-1              eg. 10021-8.txt
Chatacter set encoding: iso-8859-1              eg. 10026-8.txt
Chatacter set encoding: MP3                     eg. 10137-m-readme.txt
Chatacter set encoding: Sibelius 3 SIB format and MP3 audio
                                                eg. 10344-readme.txt
Chatacter set encoding: US-ASCII                eg. 10021.txt

[-- Attachment #4: Type: text/plain, Size: 142 bytes --]

_______________________________________________
Emacs-devel mailing list
Emacs-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-devel

       reply	other threads:[~2005-10-24 22:33 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <87u0ff9387.fsf@zip.com.au>
     [not found] ` <E1ES3v6-0006dW-M6@fencepost.gnu.org>
     [not found]   ` <87wtk9uqcq.fsf@zip.com.au>
     [not found]     ` <E1ESdmi-0002fk-Pp@fencepost.gnu.org>
2005-10-24 22:33       ` Kevin Ryde [this message]
2005-10-25  2:11         ` gutenberg-coding.el -- coding system for Project Gutenberg files Kenichi Handa
2005-10-25  3:53           ` Stefan Monnier
2005-10-26  1:43             ` Kevin Ryde
2005-10-27  1:29               ` Richard M. Stallman
2005-10-27  2:04                 ` Kevin Ryde
2005-10-28  3:47                   ` Richard M. Stallman
2005-11-10 21:06                     ` Kevin Ryde
2005-10-27 23:57                 ` Kevin Ryde
2005-10-25 20:27           ` Richard M. Stallman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87oe5eeea2.fsf@zip.com.au \
    --to=user42@zip.com.au \
    --cc=rms@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.