unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Re: gutenberg-coding.el -- coding system for Project Gutenberg files
       [not found]     ` <E1ESdmi-0002fk-Pp@fencepost.gnu.org>
@ 2005-10-24 22:33       ` Kevin Ryde
  2005-10-25  2:11         ` Kenichi Handa
  0 siblings, 1 reply; 10+ messages in thread
From: Kevin Ryde @ 2005-10-24 22:33 UTC (permalink / raw)
  Cc: rms

[-- Attachment #1: Type: text/plain, Size: 557 bytes --]

This is my go at Project Gutenberg ebook/etext coding system detection
adapted to the emacs cvs.

The charset names in the texts are slightly free-form and need an
unhappy amount of massaging.  "list.log" below is what I grepped out
of all the current files (about 23000 of them).

Some charset names are obvious typos (I reported them), but it doesn't
hurt to allow them.

2005-10-24  Kevin Ryde  <user42@zip.com.au>

        * international/mule.el (project-gutenberg-auto-coding-function): New
        function.
        (auto-coding-functions): Add it.


[-- Attachment #2: mule.el.gutenberg.diff --]
[-- Type: text/plain, Size: 5769 bytes --]

Index: mule.el
===================================================================
RCS file: /cvsroot/emacs/emacs/lisp/international/mule.el,v
retrieving revision 1.227
diff -u -c -r1.227 mule.el
cvs server: conflicting specifications of output style
*** mule.el	23 Oct 2005 18:24:00 -0000	1.227
--- mule.el	24 Oct 2005 22:06:19 -0000
***************
*** 1588,1594 ****
  		       (symbol :tag "Coding system"))))
  
  ;; See the bottom of this file for built-in auto coding functions.
! (defcustom auto-coding-functions '(sgml-xml-auto-coding-function
  				   sgml-html-meta-auto-coding-function)
    "A list of functions which attempt to determine a coding system.
  
--- 1588,1595 ----
  		       (symbol :tag "Coding system"))))
  
  ;; See the bottom of this file for built-in auto coding functions.
! (defcustom auto-coding-functions '(project-gutenberg-auto-coding-function
! 				   sgml-xml-auto-coding-function
  				   sgml-html-meta-auto-coding-function)
    "A list of functions which attempt to determine a coding system.
  
***************
*** 2204,2209 ****
--- 2205,2307 ----
  
  
  ;;; Built-in auto-coding-functions:
+ 
+ (defun project-gutenberg-auto-coding-function (size)
+   "Determine character encoding of a Project Gutenberg EBook/Etext.
+ This function is designed for use in `auto-coding-functions'.
+ 
+ A Project Gutenberg text has \"Project Gutenberg\" in the first line, and a
+ subsequent \"Character set encoding:\" line.  The latter gives the coding
+ system.
+ 
+ Some early non-ASCII texts don't have a \"Character set encoding:\", for
+ those you have to use other Emacs mechanisms (eg. \\[universal-coding-system-argument]).
+ 
+ See http://www.gutenberg.org for more about Project Gutenberg."
+ 
+   (and (looking-at ".*Project Gutenberg")
+ 
+        ;; The regexp here is "^Cha[rt]acter set encoding: *\\(.*\\)", except
+        ;; tweaked to avoid trailing spaces and \r in the match-string.
+        ;;
+        ;; Project Gutenberg files are CRLF line endings (usually) so \r is
+        ;; normal; and trailing spaces have been seen in a few files.
+        ;;
+        ;; "Chatacter" is a typo seen in about 220 files as of 2005 (though
+        ;; only 38 are non-ASCII).
+        ;;
+        (re-search-forward
+         "^Cha[rt]acter set encoding:[ \t\r]*\\(\\([ \t\r]*[^ \t\r\n]+\\)*\\)"
+         ;; only search first 200 lines
+         (save-excursion (forward-line 200) (point))
+         t)
+ 
+        ;; The character set names are slightly free form.  They're perfectly
+        ;; understandable to a human, but need some massaging to get
+        ;; something `locale-charset-to-coding-system' can handle.  The stuff
+        ;; below was tested on the full set of files in 2005.
+        ;;
+        ;; Some readme.txt files have "MP3" or the like given as the
+        ;; character set, which is bogus, it refers to the existance of .mp3
+        ;; files, the .txt is plain ascii.  We let such cases get the warning
+        ;; message.
+ 
+        (let* ((orig-charset (match-string 1))
+               (charset      (downcase orig-charset)))
+ 
+          ;; "ascii"                 -> "us-ascii"
+          ;; "iso-646-us (us-ascii)" -> "us-ascii"
+          (if (member charset '("ascii" "iso-646-us (us-ascii)"))
+              (setq charset "us-ascii"))
+ 
+          ;; "ascii, with a few iso-8859-1 characters" etc -> "iso-8859-1"
+          ;; "acii, with some iso-8859-1 characters"       -> "iso-8859-1"
+          ;; the "acii" is a typo in dvptn10.txt, easy enough to allow it
+          (setq charset (replace-regexp-in-string
+                         "^as?cii[ (,]*with.* \\(iso-8859-[0-9]+\\).*"
+                         "\\1" charset t))
+ 
+          ;; "cp-1250"                -> "windows-1250"
+          ;; "cp1251"                 -> "windows-1251"
+          ;; "codepage 1250"          -> "windows-1250"
+          ;; "windows codepage 1252"  -> "windows-1252"
+          ;; "windows code page 1252" -> "windows-1252"
+          (setq charset (replace-regexp-in-string
+                         "^\\(cp\\|codepage\\|windows \\(code ?page\\)?\\)[ -]*"
+                         "windows-" charset t t))
+ 
+          ;; "unicode" alone -> "utf-8", found in 10752-8.txt
+          (setq charset (replace-regexp-in-string "^unicode\r?$" "utf-8"
+                                                  charset t))
+ 
+          ;; "unicode utf-8" -> "utf-8"
+          (setq charset (replace-regexp-in-string "^unicode utf" "utf"
+                                                  charset t t))
+ 
+          ;; "unicode (utf-8)" -> "utf-8"
+          (setq charset (replace-regexp-in-string "^unicode (\\(.*\\))$" "\\1"
+                                                  charset t))
+ 
+          ;; "iso-8858-1" -> "iso-8859-1", typo in 10439-8.txt
+          (setq charset (replace-regexp-in-string "8858" "8859" charset t t))
+ 
+          ;; "ido-8859-1" -> "iso-8859-1", typo in 10549-8.txt
+          (setq charset (replace-regexp-in-string "^ido-" "iso-" charset t t))
+ 
+          ;; "iso 8859-1 (latin-1)" -> "latin-1"
+          (setq charset (replace-regexp-in-string
+                         "^iso 8859-\\([0-9]+\\) (\\(latin-\\1\\))$"
+                         "\\2" charset t))
+ 
+          ;; "iso=8859-1" -> "iso-8859-1"
+          ;; "big 5"      -> "big-5"
+          (setq charset (replace-regexp-in-string "[= ]" "-" charset t t))
+ 
+          (or (locale-charset-to-coding-system charset)
+              (progn
+                (message "Warning: unknown coding system \"%s\""
+                         orig-charset)
+                nil)))))
  
  (defun sgml-xml-auto-coding-function (size)
    "Determine whether the buffer is XML, and if so, its encoding.

[-- Attachment #3: list.log --]
[-- Type: text/plain, Size: 5124 bytes --]

Character set encoding:  ASCII                  eg. kimrk12.txt
Character set encoding:  ISO8859_1              eg. c1001107.txt
Character set encoding: ACII, with some ISO-8859-1 characters
                                                eg. dvptn10.txt
Character set encoding: ASCII                   eg. 10001.txt
Character set encoding: ASCII                     
                                                eg. oh11v10.txt
Character set encoding: ASCII, with 2 ISO-8859-1 characters
                                                eg. prpsl10.txt
Character set encoding: ASCII, with a couple of ISO-8859-1 characters
                                                eg. jrcl610.txt
Character set encoding: ASCII, with a few ISO-8859-1 characters
                                                eg. cnnet10.txt
Character set encoding: ASCII (with a few ISO-8859-1 characters)
                                                eg. ltlbh10.txt
Character set encoding: ASCII, with one ISO-8859-1 character
                                                eg. srhrl10.txt
Character set encoding: ASCII, with some ISO-8859-1 characters
                                                eg. bough11.txt
Character set encoding: ASCII, with two ISO-8859-1 characters
                                                eg. prphi10.txt
Character set encoding: Big 5                   eg. dxizi10.txt
Character set encoding: BIG-5                   eg. 8dxzj10.txt
Character set encoding: Big5                    eg. wesik10.txt
Character set encoding: Codepage 1250           eg. sklep10.txt
Character set encoding: CP-1250                 eg. 13083-8.txt
Character set encoding: CP-1251                 eg. 14741-8.txt
Character set encoding: CP-1252                 eg. 12732-8.txt
Character set encoding: CP1251                  eg. 11292-8.txt
Character set encoding: cp1251                  eg. kknta10.txt
Character set encoding: CP1252                  eg. 8ledo10.txt
Character set encoding: EUC-KR                  eg. kedct10.txt
Character set encoding: IDO-8859-1              eg. 10549-8.txt
Character set encoding: ISO-646-US (US-ASCII)   eg. 107.txt
Character set encoding: ISO-8858-1              eg. 10439-8.txt
Character set encoding: ISO 8859-1              eg. 8bld410.txt
Character set encoding: ISO-8859-1              eg. 10002-8.txt
Character set encoding: iso-8859-1              eg. 10429-8.txt
Character set encoding: ISO=8859-1              eg. 7fool10.txt
Character set encoding: ISO 8859-1 (Latin-1)    eg. 8adio10.txt
Character set encoding: iso-8859-15             eg. 8dlrm10.txt
Character set encoding: ISO-8859-2              eg. rnpz810.txt
Character set encoding: ISO Latin-1             eg. 10056-8.txt
Character set encoding: ISO-LATIN-1             eg. 8nggd10.txt
Character set encoding: ISO-Latin-1             eg. hstrd10.txt
Character set encoding: iso-Latin-1             eg. 8wpwl10.txt
Character set encoding: iso-latin-1             eg. 8engl10.txt
Character set encoding: ISO8859-1               eg. 7bjrn10.txt
Character set encoding: ISO8859_1               eg. a1001107.txt
Character set encoding: KOI8-R                  eg. ktria10.txt
Character set encoding: Latin 1                 eg. divrw10.txt
Character set encoding: Latin-1                 eg. 8dawn10.txt
Character set encoding: Latin-4                 eg. kalev10.txt
Character set encoding: Latin1                  eg. 10347-8.txt
Character set encoding: MP3                     eg. 10348-m-readme.txt
Character set encoding: MPEG                    eg. atomi10m-readme.txt
Character set encoding: MPEG Layer 3 (MP3)      eg. 1donq3-readme.txt
Character set encoding: Unicode                 eg. 10752-8.txt
Character set encoding: Unicode (UTF-8)         eg. orama10u.txt
Character set encoding: Unicode UTF-8           eg. 11753-0.txt
Character set encoding: US-ASCII                eg. 10078.txt
Character set encoding: US-ASCII, MIDI, Lilypond, MP3 and TeX
                                                eg. 10535.txt
Character set encoding: UTF-16                  eg. 13083-utf16.txt
Character set encoding: UTF-7                   eg. 8cart10.txt
Character set encoding: UTF-8                   eg. 10140-0.txt
Character set encoding: utf-8                   eg. astrl10.txt
Character set encoding: UTF8                    eg. 8gslt10.txt
Character set encoding: Windows-1250            eg. 15201-8.txt
Character set encoding: Windows 1251            eg. olavg10.txt
Character set encoding: Windows-1252            eg. 8clcn10.txt
Character set encoding: Windows Code Page 1252  eg. 8tjna10.txt
Character set encoding: Windows Codepage 1252   eg. 8vepi10.txt
Character set encoding: Windows1253             eg. orama10.txt
Chatacter set encoding: ISO-8859-1              eg. 10021-8.txt
Chatacter set encoding: iso-8859-1              eg. 10026-8.txt
Chatacter set encoding: MP3                     eg. 10137-m-readme.txt
Chatacter set encoding: Sibelius 3 SIB format and MP3 audio
                                                eg. 10344-readme.txt
Chatacter set encoding: US-ASCII                eg. 10021.txt

[-- Attachment #4: Type: text/plain, Size: 142 bytes --]

_______________________________________________
Emacs-devel mailing list
Emacs-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gutenberg-coding.el -- coding system for Project Gutenberg files
  2005-10-24 22:33       ` gutenberg-coding.el -- coding system for Project Gutenberg files Kevin Ryde
@ 2005-10-25  2:11         ` Kenichi Handa
  2005-10-25  3:53           ` Stefan Monnier
  2005-10-25 20:27           ` Richard M. Stallman
  0 siblings, 2 replies; 10+ messages in thread
From: Kenichi Handa @ 2005-10-25  2:11 UTC (permalink / raw)
  Cc: rms, emacs-devel

In article <87oe5eeea2.fsf@zip.com.au>, Kevin Ryde <user42@zip.com.au> writes:

> [1  <text/plain (7bit)>]
> This is my go at Project Gutenberg ebook/etext coding system detection
> adapted to the emacs cvs.

> The charset names in the texts are slightly free-form and need an
> unhappy amount of massaging.  "list.log" below is what I grepped out
> of all the current files (about 23000 of them).

> Some charset names are obvious typos (I reported them), but it doesn't
> hurt to allow them.

I think that the code is good to be included in Emacs.  But,
as it's not a bug fix, and I think not many people benefit
from that (how many people read Gutenberg text file?).  So,
I'd like to ask Richard to decide whether we include it now
or postpone it.

---
Kenichi Handa
handa@m17n.org

> 2005-10-24  Kevin Ryde  <user42@zip.com.au>

>         * international/mule.el (project-gutenberg-auto-coding-function): New
>         function.
>         (auto-coding-functions): Add it.

> [2 mule.el.gutenberg.diff <text/plain (7bit)>]
> Index: mule.el
> ===================================================================
> RCS file: /cvsroot/emacs/emacs/lisp/international/mule.el,v
> retrieving revision 1.227
> diff -u -c -r1.227 mule.el
> cvs server: conflicting specifications of output style
> *** mule.el	23 Oct 2005 18:24:00 -0000	1.227
> --- mule.el	24 Oct 2005 22:06:19 -0000
> ***************
> *** 1588,1594 ****
>   		       (symbol :tag "Coding system"))))
  
>   ;; See the bottom of this file for built-in auto coding functions.
> ! (defcustom auto-coding-functions '(sgml-xml-auto-coding-function
>   				   sgml-html-meta-auto-coding-function)
>     "A list of functions which attempt to determine a coding system.
  
> --- 1588,1595 ----
>   		       (symbol :tag "Coding system"))))
  
>   ;; See the bottom of this file for built-in auto coding functions.
> ! (defcustom auto-coding-functions '(project-gutenberg-auto-coding-function
> ! 				   sgml-xml-auto-coding-function
>   				   sgml-html-meta-auto-coding-function)
>     "A list of functions which attempt to determine a coding system.
  
> ***************
> *** 2204,2209 ****
> --- 2205,2307 ----
  
  
>   ;;; Built-in auto-coding-functions:
> + 
> + (defun project-gutenberg-auto-coding-function (size)
> +   "Determine character encoding of a Project Gutenberg EBook/Etext.
> + This function is designed for use in `auto-coding-functions'.
> + 
> + A Project Gutenberg text has \"Project Gutenberg\" in the first line, and a
> + subsequent \"Character set encoding:\" line.  The latter gives the coding
> + system.
> + 
> + Some early non-ASCII texts don't have a \"Character set encoding:\", for
> + those you have to use other Emacs mechanisms (eg. \\[universal-coding-system-argument]).
> + 
> + See http://www.gutenberg.org for more about Project Gutenberg."
> + 
> +   (and (looking-at ".*Project Gutenberg")
> + 
> +        ;; The regexp here is "^Cha[rt]acter set encoding: *\\(.*\\)", except
> +        ;; tweaked to avoid trailing spaces and \r in the match-string.
> +        ;;
> +        ;; Project Gutenberg files are CRLF line endings (usually) so \r is
> +        ;; normal; and trailing spaces have been seen in a few files.
> +        ;;
> +        ;; "Chatacter" is a typo seen in about 220 files as of 2005 (though
> +        ;; only 38 are non-ASCII).
> +        ;;
> +        (re-search-forward
> +         "^Cha[rt]acter set encoding:[ \t\r]*\\(\\([ \t\r]*[^ \t\r\n]+\\)*\\)"
> +         ;; only search first 200 lines
> +         (save-excursion (forward-line 200) (point))
> +         t)
> + 
> +        ;; The character set names are slightly free form.  They're perfectly
> +        ;; understandable to a human, but need some massaging to get
> +        ;; something `locale-charset-to-coding-system' can handle.  The stuff
> +        ;; below was tested on the full set of files in 2005.
> +        ;;
> +        ;; Some readme.txt files have "MP3" or the like given as the
> +        ;; character set, which is bogus, it refers to the existance of .mp3
> +        ;; files, the .txt is plain ascii.  We let such cases get the warning
> +        ;; message.
> + 
> +        (let* ((orig-charset (match-string 1))
> +               (charset      (downcase orig-charset)))
> + 
> +          ;; "ascii"                 -> "us-ascii"
> +          ;; "iso-646-us (us-ascii)" -> "us-ascii"
> +          (if (member charset '("ascii" "iso-646-us (us-ascii)"))
> +              (setq charset "us-ascii"))
> + 
> +          ;; "ascii, with a few iso-8859-1 characters" etc -> "iso-8859-1"
> +          ;; "acii, with some iso-8859-1 characters"       -> "iso-8859-1"
> +          ;; the "acii" is a typo in dvptn10.txt, easy enough to allow it
> +          (setq charset (replace-regexp-in-string
> +                         "^as?cii[ (,]*with.* \\(iso-8859-[0-9]+\\).*"
> +                         "\\1" charset t))
> + 
> +          ;; "cp-1250"                -> "windows-1250"
> +          ;; "cp1251"                 -> "windows-1251"
> +          ;; "codepage 1250"          -> "windows-1250"
> +          ;; "windows codepage 1252"  -> "windows-1252"
> +          ;; "windows code page 1252" -> "windows-1252"
> +          (setq charset (replace-regexp-in-string
> +                         "^\\(cp\\|codepage\\|windows \\(code ?page\\)?\\)[ -]*"
> +                         "windows-" charset t t))
> + 
> +          ;; "unicode" alone -> "utf-8", found in 10752-8.txt
> +          (setq charset (replace-regexp-in-string "^unicode\r?$" "utf-8"
> +                                                  charset t))
> + 
> +          ;; "unicode utf-8" -> "utf-8"
> +          (setq charset (replace-regexp-in-string "^unicode utf" "utf"
> +                                                  charset t t))
> + 
> +          ;; "unicode (utf-8)" -> "utf-8"
> +          (setq charset (replace-regexp-in-string "^unicode (\\(.*\\))$" "\\1"
> +                                                  charset t))
> + 
> +          ;; "iso-8858-1" -> "iso-8859-1", typo in 10439-8.txt
> +          (setq charset (replace-regexp-in-string "8858" "8859" charset t t))
> + 
> +          ;; "ido-8859-1" -> "iso-8859-1", typo in 10549-8.txt
> +          (setq charset (replace-regexp-in-string "^ido-" "iso-" charset t t))
> + 
> +          ;; "iso 8859-1 (latin-1)" -> "latin-1"
> +          (setq charset (replace-regexp-in-string
> +                         "^iso 8859-\\([0-9]+\\) (\\(latin-\\1\\))$"
> +                         "\\2" charset t))
> + 
> +          ;; "iso=8859-1" -> "iso-8859-1"
> +          ;; "big 5"      -> "big-5"
> +          (setq charset (replace-regexp-in-string "[= ]" "-" charset t t))
> + 
> +          (or (locale-charset-to-coding-system charset)
> +              (progn
> +                (message "Warning: unknown coding system \"%s\""
> +                         orig-charset)
> +                nil)))))
  
>   (defun sgml-xml-auto-coding-function (size)
>     "Determine whether the buffer is XML, and if so, its encoding.
> [3 list.log <text/plain (7bit)>]
> Character set encoding:  ASCII                  eg. kimrk12.txt
> Character set encoding:  ISO8859_1              eg. c1001107.txt
> Character set encoding: ACII, with some ISO-8859-1 characters
>                                                 eg. dvptn10.txt
> Character set encoding: ASCII                   eg. 10001.txt
> Character set encoding: ASCII                     
>                                                 eg. oh11v10.txt
> Character set encoding: ASCII, with 2 ISO-8859-1 characters
>                                                 eg. prpsl10.txt
> Character set encoding: ASCII, with a couple of ISO-8859-1 characters
>                                                 eg. jrcl610.txt
> Character set encoding: ASCII, with a few ISO-8859-1 characters
>                                                 eg. cnnet10.txt
> Character set encoding: ASCII (with a few ISO-8859-1 characters)
>                                                 eg. ltlbh10.txt
> Character set encoding: ASCII, with one ISO-8859-1 character
>                                                 eg. srhrl10.txt
> Character set encoding: ASCII, with some ISO-8859-1 characters
>                                                 eg. bough11.txt
> Character set encoding: ASCII, with two ISO-8859-1 characters
>                                                 eg. prphi10.txt
> Character set encoding: Big 5                   eg. dxizi10.txt
> Character set encoding: BIG-5                   eg. 8dxzj10.txt
> Character set encoding: Big5                    eg. wesik10.txt
> Character set encoding: Codepage 1250           eg. sklep10.txt
> Character set encoding: CP-1250                 eg. 13083-8.txt
> Character set encoding: CP-1251                 eg. 14741-8.txt
> Character set encoding: CP-1252                 eg. 12732-8.txt
> Character set encoding: CP1251                  eg. 11292-8.txt
> Character set encoding: cp1251                  eg. kknta10.txt
> Character set encoding: CP1252                  eg. 8ledo10.txt
> Character set encoding: EUC-KR                  eg. kedct10.txt
> Character set encoding: IDO-8859-1              eg. 10549-8.txt
> Character set encoding: ISO-646-US (US-ASCII)   eg. 107.txt
> Character set encoding: ISO-8858-1              eg. 10439-8.txt
> Character set encoding: ISO 8859-1              eg. 8bld410.txt
> Character set encoding: ISO-8859-1              eg. 10002-8.txt
> Character set encoding: iso-8859-1              eg. 10429-8.txt
> Character set encoding: ISO=8859-1              eg. 7fool10.txt
> Character set encoding: ISO 8859-1 (Latin-1)    eg. 8adio10.txt
> Character set encoding: iso-8859-15             eg. 8dlrm10.txt
> Character set encoding: ISO-8859-2              eg. rnpz810.txt
> Character set encoding: ISO Latin-1             eg. 10056-8.txt
> Character set encoding: ISO-LATIN-1             eg. 8nggd10.txt
> Character set encoding: ISO-Latin-1             eg. hstrd10.txt
> Character set encoding: iso-Latin-1             eg. 8wpwl10.txt
> Character set encoding: iso-latin-1             eg. 8engl10.txt
> Character set encoding: ISO8859-1               eg. 7bjrn10.txt
> Character set encoding: ISO8859_1               eg. a1001107.txt
> Character set encoding: KOI8-R                  eg. ktria10.txt
> Character set encoding: Latin 1                 eg. divrw10.txt
> Character set encoding: Latin-1                 eg. 8dawn10.txt
> Character set encoding: Latin-4                 eg. kalev10.txt
> Character set encoding: Latin1                  eg. 10347-8.txt
> Character set encoding: MP3                     eg. 10348-m-readme.txt
> Character set encoding: MPEG                    eg. atomi10m-readme.txt
> Character set encoding: MPEG Layer 3 (MP3)      eg. 1donq3-readme.txt
> Character set encoding: Unicode                 eg. 10752-8.txt
> Character set encoding: Unicode (UTF-8)         eg. orama10u.txt
> Character set encoding: Unicode UTF-8           eg. 11753-0.txt
> Character set encoding: US-ASCII                eg. 10078.txt
> Character set encoding: US-ASCII, MIDI, Lilypond, MP3 and TeX
>                                                 eg. 10535.txt
> Character set encoding: UTF-16                  eg. 13083-utf16.txt
> Character set encoding: UTF-7                   eg. 8cart10.txt
> Character set encoding: UTF-8                   eg. 10140-0.txt
> Character set encoding: utf-8                   eg. astrl10.txt
> Character set encoding: UTF8                    eg. 8gslt10.txt
> Character set encoding: Windows-1250            eg. 15201-8.txt
> Character set encoding: Windows 1251            eg. olavg10.txt
> Character set encoding: Windows-1252            eg. 8clcn10.txt
> Character set encoding: Windows Code Page 1252  eg. 8tjna10.txt
> Character set encoding: Windows Codepage 1252   eg. 8vepi10.txt
> Character set encoding: Windows1253             eg. orama10.txt
> Chatacter set encoding: ISO-8859-1              eg. 10021-8.txt
> Chatacter set encoding: iso-8859-1              eg. 10026-8.txt
> Chatacter set encoding: MP3                     eg. 10137-m-readme.txt
> Chatacter set encoding: Sibelius 3 SIB format and MP3 audio
>                                                 eg. 10344-readme.txt
> Chatacter set encoding: US-ASCII                eg. 10021.txt
> [4  <text/plain; us-ascii (7bit)>]
> _______________________________________________
> Emacs-devel mailing list
> Emacs-devel@gnu.org
> http://lists.gnu.org/mailman/listinfo/emacs-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gutenberg-coding.el -- coding system for Project Gutenberg files
  2005-10-25  2:11         ` Kenichi Handa
@ 2005-10-25  3:53           ` Stefan Monnier
  2005-10-26  1:43             ` Kevin Ryde
  2005-10-25 20:27           ` Richard M. Stallman
  1 sibling, 1 reply; 10+ messages in thread
From: Stefan Monnier @ 2005-10-25  3:53 UTC (permalink / raw)
  Cc: Kevin Ryde, rms, emacs-devel

>> This is my go at Project Gutenberg ebook/etext coding system detection
>> adapted to the emacs cvs.

Since those files are not used very commonly I think it's difficult to
justify the risk of checking for the presence of a Gutenberg "coding cookie"
on each and every file.
Don't those files have a (set of) standardized extensions?


        Stefan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gutenberg-coding.el -- coding system for Project Gutenberg files
  2005-10-25  2:11         ` Kenichi Handa
  2005-10-25  3:53           ` Stefan Monnier
@ 2005-10-25 20:27           ` Richard M. Stallman
  1 sibling, 0 replies; 10+ messages in thread
From: Richard M. Stallman @ 2005-10-25 20:27 UTC (permalink / raw)
  Cc: user42, emacs-devel

    I think that the code is good to be included in Emacs.  But,
    as it's not a bug fix, and I think not many people benefit
    from that (how many people read Gutenberg text file?).  So,
    I'd like to ask Richard to decide whether we include it now
    or postpone it.

It fixes something--whether to call it a bug or a gap is not clear to
me.  And it seems harmless.  So there is no reason to wait for
the release.

However, the issue of possible false matches that Stefan raised is a
something we should look at first.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gutenberg-coding.el -- coding system for Project Gutenberg files
  2005-10-25  3:53           ` Stefan Monnier
@ 2005-10-26  1:43             ` Kevin Ryde
  2005-10-27  1:29               ` Richard M. Stallman
  0 siblings, 1 reply; 10+ messages in thread
From: Kevin Ryde @ 2005-10-26  1:43 UTC (permalink / raw)


Stefan Monnier <monnier@iro.umontreal.ca> writes:
>
> Since those files are not used very commonly I think it's difficult to
> justify the risk of checking for the presence of a Gutenberg "coding cookie"
> on each and every file.

Hopefully the risk is small.  Or put it this way, is there likely to
be a file in the matched form, but which is in some encoding other
than the one stated.

I did want to do something tighter for the first-line match, or locate
the end of the header info part, but the format varies too much.

> Don't those files have a (set of) standardized extensions?

The normal files (the ones I'm interested in) are .txt.  (There's
others like tex or html, txt is the majority.)

(The downloads are available as .zip containing .txt, and I presume
most people will use that, so something that works from archive-mode
as well as a plain find-file is highly desirable.)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gutenberg-coding.el -- coding system for Project Gutenberg files
  2005-10-26  1:43             ` Kevin Ryde
@ 2005-10-27  1:29               ` Richard M. Stallman
  2005-10-27  2:04                 ` Kevin Ryde
  2005-10-27 23:57                 ` Kevin Ryde
  0 siblings, 2 replies; 10+ messages in thread
From: Richard M. Stallman @ 2005-10-27  1:29 UTC (permalink / raw)
  Cc: emacs-devel

    > Don't those files have a (set of) standardized extensions?

    The normal files (the ones I'm interested in) are .txt.  (There's
    others like tex or html, txt is the majority.)

We could do this only for files called .txt, I suppose.
That would not eliminate false matches, but would limit them.

    I did want to do something tighter for the first-line match, or locate
    the end of the header info part, but the format varies too much.

How about if you send Project Gutenberg a suggestion asking them if
they would make the format more standard, to help programs DTRT?
If you offer one or two specific suggestions, that would be useful.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gutenberg-coding.el -- coding system for Project Gutenberg files
  2005-10-27  1:29               ` Richard M. Stallman
@ 2005-10-27  2:04                 ` Kevin Ryde
  2005-10-28  3:47                   ` Richard M. Stallman
  2005-10-27 23:57                 ` Kevin Ryde
  1 sibling, 1 reply; 10+ messages in thread
From: Kevin Ryde @ 2005-10-27  2:04 UTC (permalink / raw)


"Richard M. Stallman" <rms@gnu.org> writes:
>
> How about if you send Project Gutenberg a suggestion asking them if
> they would make the format more standard, to help programs DTRT?
> If you offer one or two specific suggestions, that would be useful.

Yep.  Though with about 16000 texts, and a fairly conservative update
policy (from what I read in the faqs) I don't suppose that can happen
immediately.

Which is to say, I think there's merit in emacs working with what's
presently published/mirrored/etc, if it can be done in an acceptable
way.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gutenberg-coding.el -- coding system for Project Gutenberg files
  2005-10-27  1:29               ` Richard M. Stallman
  2005-10-27  2:04                 ` Kevin Ryde
@ 2005-10-27 23:57                 ` Kevin Ryde
  1 sibling, 0 replies; 10+ messages in thread
From: Kevin Ryde @ 2005-10-27 23:57 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 939 bytes --]

"Richard M. Stallman" <rms@gnu.org> writes:
>
> We could do this only for files called .txt, I suppose.
> That would not eliminate false matches, but would limit them.

I realized that the only files needing to be matched are the non-ascii
ones with a charset spec.  Doh.  So I think the test can be for a file
starting with one of

	"Project Gutenberg "
	"Project Gutenberg's "
	"The Project Gutenberg "
	"**This is a COPYRIGHTED Project Gutenberg "

and possibly with bytes

	0xEF 0xBB 0xBF

before those, which is in some (but not all) utf-8 files.  This is
tighter than just "Project Gutenberg" anywhere in the first line.  New
diff below.

I put just "..." to match the three marker bytes.  I'd like to put
those exactly, but it will be matched against a unibyte buffer (if I'm
not mistaken), and I'm unsure how to give a unibyte string literal, or
a multibyte which will match correctly.  (A couple of things I tried
didn't work.)



[-- Attachment #2: mule.el.gutenberg-2.diff --]
[-- Type: text/plain, Size: 6207 bytes --]

Index: mule.el
===================================================================
RCS file: /cvsroot/emacs/emacs/lisp/international/mule.el,v
retrieving revision 1.227
diff -u -c -r1.227 mule.el
cvs server: conflicting specifications of output style
*** mule.el	23 Oct 2005 18:24:00 -0000	1.227
--- mule.el	27 Oct 2005 23:56:29 -0000
***************
*** 1588,1594 ****
  		       (symbol :tag "Coding system"))))
  
  ;; See the bottom of this file for built-in auto coding functions.
! (defcustom auto-coding-functions '(sgml-xml-auto-coding-function
  				   sgml-html-meta-auto-coding-function)
    "A list of functions which attempt to determine a coding system.
  
--- 1588,1595 ----
  		       (symbol :tag "Coding system"))))
  
  ;; See the bottom of this file for built-in auto coding functions.
! (defcustom auto-coding-functions '(project-gutenberg-auto-coding-function
! 				   sgml-xml-auto-coding-function
  				   sgml-html-meta-auto-coding-function)
    "A list of functions which attempt to determine a coding system.
  
***************
*** 2204,2209 ****
--- 2205,2315 ----
  
  
  ;;; Built-in auto-coding-functions:
+ 
+ (defun project-gutenberg-auto-coding-function (size)
+   "Determine character encoding of a Project Gutenberg EBook/Etext.
+ This function is designed for use in `auto-coding-functions'.
+ 
+ A Project Gutenberg text has \"Project Gutenberg\" in the first line, and a
+ subsequent \"Character set encoding:\" line.  The latter gives the coding
+ system.
+ 
+ Some early non-ASCII texts don't have a \"Character set encoding:\", for
+ those you have to use other Emacs mechanisms (eg. \\[universal-coding-system-argument]).
+ 
+ See http://www.gutenberg.org for more about Project Gutenberg."
+ 
+   ;; This regexp identifies a gutenberg file, it's kept fairly tight to
+   ;; avoid false matches.
+   ;;
+   ;; Many early gutenberg files have different first lines, but the
+   ;; alternatives here are enough for the non-ascii files existing in 2005.
+   ;;
+   ;; Some (but not all) utf-8 files begin with a marker sequence EF BB BF.
+ 
+   (and (looking-at "\\(...\\)?\\(Project Gutenberg\\('s\\)?\\|The Project Gutenberg\\|\\**This is a COPYRIGHTED Project Gutenberg\\) ")
+ 
+        ;; The regexp here is "^Cha[rt]acter set encoding: *\\(.*\\)", except
+        ;; tweaked to avoid trailing spaces and \r in the match-string.
+        ;;
+        ;; Project Gutenberg files are CRLF line endings (usually) so \r is
+        ;; normal; and trailing spaces have been seen in a few files.
+        ;;
+        ;; "Chatacter" is a typo seen in about 220 files as of 2005 (though
+        ;; only 38 are non-ASCII).
+        ;;
+        (re-search-forward
+         "^Cha[rt]acter set encoding:[ \t\r]*\\(\\([ \t\r]*[^ \t\r\n]+\\)*\\)"
+         ;; only search first 200 lines
+         (save-excursion (forward-line 200) (point))
+         t)
+ 
+        ;; The character set names are slightly free form.  They're perfectly
+        ;; understandable to a human, but need some massaging to get
+        ;; something `locale-charset-to-coding-system' can handle.  The stuff
+        ;; below was tested on the full set of files in 2005.
+        ;;
+        ;; Some readme.txt files have "MP3" or the like given as the
+        ;; character set, which is bogus, it refers to the existance of .mp3
+        ;; files, the .txt is plain ascii.  We let such cases get the warning
+        ;; message.
+ 
+        (let* ((orig-charset (match-string 1))
+               (charset      (downcase orig-charset)))
+ 
+          ;; "ascii"                 -> "us-ascii"
+          ;; "iso-646-us (us-ascii)" -> "us-ascii"
+          (if (member charset '("ascii" "iso-646-us (us-ascii)"))
+              (setq charset "us-ascii"))
+ 
+          ;; "ascii, with a few iso-8859-1 characters" etc -> "iso-8859-1"
+          ;; "acii, with some iso-8859-1 characters"       -> "iso-8859-1"
+          ;; the "acii" is a typo in dvptn10.txt, easy enough to allow it
+          (setq charset (replace-regexp-in-string
+                         "^as?cii[ (,]*with.* \\(iso-8859-[0-9]+\\).*"
+                         "\\1" charset t))
+ 
+          ;; "cp-1250"                -> "windows-1250"
+          ;; "cp1251"                 -> "windows-1251"
+          ;; "codepage 1250"          -> "windows-1250"
+          ;; "windows codepage 1252"  -> "windows-1252"
+          ;; "windows code page 1252" -> "windows-1252"
+          (setq charset (replace-regexp-in-string
+                         "^\\(cp\\|codepage\\|windows \\(code ?page\\)?\\)[ -]*"
+                         "windows-" charset t t))
+ 
+          ;; "unicode" alone -> "utf-8", found in 10752-8.txt
+          (setq charset (replace-regexp-in-string "^unicode\r?$" "utf-8"
+                                                  charset t))
+ 
+          ;; "unicode utf-8" -> "utf-8"
+          (setq charset (replace-regexp-in-string "^unicode utf" "utf"
+                                                  charset t t))
+ 
+          ;; "unicode (utf-8)" -> "utf-8"
+          (setq charset (replace-regexp-in-string "^unicode (\\(.*\\))$" "\\1"
+                                                  charset t))
+ 
+          ;; "iso-8858-1" -> "iso-8859-1", typo in 10439-8.txt
+          (setq charset (replace-regexp-in-string "8858" "8859" charset t t))
+ 
+          ;; "ido-8859-1" -> "iso-8859-1", typo in 10549-8.txt
+          (setq charset (replace-regexp-in-string "^ido-" "iso-" charset t t))
+ 
+          ;; "iso 8859-1 (latin-1)" -> "latin-1"
+          (setq charset (replace-regexp-in-string
+                         "^iso 8859-\\([0-9]+\\) (\\(latin-\\1\\))$"
+                         "\\2" charset t))
+ 
+          ;; "iso=8859-1" -> "iso-8859-1"
+          ;; "big 5"      -> "big-5"
+          (setq charset (replace-regexp-in-string "[= ]" "-" charset t t))
+ 
+          (or (locale-charset-to-coding-system charset)
+              (progn
+                (message "Warning: unknown coding system \"%s\""
+                         orig-charset)
+                nil)))))
  
  (defun sgml-xml-auto-coding-function (size)
    "Determine whether the buffer is XML, and if so, its encoding.

[-- Attachment #3: Type: text/plain, Size: 142 bytes --]

_______________________________________________
Emacs-devel mailing list
Emacs-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gutenberg-coding.el -- coding system for Project Gutenberg files
  2005-10-27  2:04                 ` Kevin Ryde
@ 2005-10-28  3:47                   ` Richard M. Stallman
  2005-11-10 21:06                     ` Kevin Ryde
  0 siblings, 1 reply; 10+ messages in thread
From: Richard M. Stallman @ 2005-10-28  3:47 UTC (permalink / raw)
  Cc: emacs-devel

    Yep.  Though with about 16000 texts, and a fairly conservative update
    policy (from what I read in the faqs) I don't suppose that can happen
    immediately.

That could well be true.  On the other hand, they would not have to do
the update by hand, file by file.  If they decide to change a format
so as to standardize, they can run a script to do it.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gutenberg-coding.el -- coding system for Project Gutenberg files
  2005-10-28  3:47                   ` Richard M. Stallman
@ 2005-11-10 21:06                     ` Kevin Ryde
  0 siblings, 0 replies; 10+ messages in thread
From: Kevin Ryde @ 2005-11-10 21:06 UTC (permalink / raw)


I send a mail to PG suggesting to help automated parsing, I didn't get
a reply.  I think it's worth working with what's available now, and
see if something better is possible later.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2005-11-10 21:06 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <87u0ff9387.fsf@zip.com.au>
     [not found] ` <E1ES3v6-0006dW-M6@fencepost.gnu.org>
     [not found]   ` <87wtk9uqcq.fsf@zip.com.au>
     [not found]     ` <E1ESdmi-0002fk-Pp@fencepost.gnu.org>
2005-10-24 22:33       ` gutenberg-coding.el -- coding system for Project Gutenberg files Kevin Ryde
2005-10-25  2:11         ` Kenichi Handa
2005-10-25  3:53           ` Stefan Monnier
2005-10-26  1:43             ` Kevin Ryde
2005-10-27  1:29               ` Richard M. Stallman
2005-10-27  2:04                 ` Kevin Ryde
2005-10-28  3:47                   ` Richard M. Stallman
2005-11-10 21:06                     ` Kevin Ryde
2005-10-27 23:57                 ` Kevin Ryde
2005-10-25 20:27           ` Richard M. Stallman

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).