* Re: gutenberg-coding.el -- coding system for Project Gutenberg files [not found] ` <E1ESdmi-0002fk-Pp@fencepost.gnu.org> @ 2005-10-24 22:33 ` Kevin Ryde 2005-10-25 2:11 ` Kenichi Handa 0 siblings, 1 reply; 10+ messages in thread From: Kevin Ryde @ 2005-10-24 22:33 UTC (permalink / raw) Cc: rms [-- Attachment #1: Type: text/plain, Size: 557 bytes --] This is my go at Project Gutenberg ebook/etext coding system detection adapted to the emacs cvs. The charset names in the texts are slightly free-form and need an unhappy amount of massaging. "list.log" below is what I grepped out of all the current files (about 23000 of them). Some charset names are obvious typos (I reported them), but it doesn't hurt to allow them. 2005-10-24 Kevin Ryde <user42@zip.com.au> * international/mule.el (project-gutenberg-auto-coding-function): New function. (auto-coding-functions): Add it. [-- Attachment #2: mule.el.gutenberg.diff --] [-- Type: text/plain, Size: 5769 bytes --] Index: mule.el =================================================================== RCS file: /cvsroot/emacs/emacs/lisp/international/mule.el,v retrieving revision 1.227 diff -u -c -r1.227 mule.el cvs server: conflicting specifications of output style *** mule.el 23 Oct 2005 18:24:00 -0000 1.227 --- mule.el 24 Oct 2005 22:06:19 -0000 *************** *** 1588,1594 **** (symbol :tag "Coding system")))) ;; See the bottom of this file for built-in auto coding functions. ! (defcustom auto-coding-functions '(sgml-xml-auto-coding-function sgml-html-meta-auto-coding-function) "A list of functions which attempt to determine a coding system. --- 1588,1595 ---- (symbol :tag "Coding system")))) ;; See the bottom of this file for built-in auto coding functions. ! (defcustom auto-coding-functions '(project-gutenberg-auto-coding-function ! sgml-xml-auto-coding-function sgml-html-meta-auto-coding-function) "A list of functions which attempt to determine a coding system. *************** *** 2204,2209 **** --- 2205,2307 ---- ;;; Built-in auto-coding-functions: + + (defun project-gutenberg-auto-coding-function (size) + "Determine character encoding of a Project Gutenberg EBook/Etext. + This function is designed for use in `auto-coding-functions'. + + A Project Gutenberg text has \"Project Gutenberg\" in the first line, and a + subsequent \"Character set encoding:\" line. The latter gives the coding + system. + + Some early non-ASCII texts don't have a \"Character set encoding:\", for + those you have to use other Emacs mechanisms (eg. \\[universal-coding-system-argument]). + + See http://www.gutenberg.org for more about Project Gutenberg." + + (and (looking-at ".*Project Gutenberg") + + ;; The regexp here is "^Cha[rt]acter set encoding: *\\(.*\\)", except + ;; tweaked to avoid trailing spaces and \r in the match-string. + ;; + ;; Project Gutenberg files are CRLF line endings (usually) so \r is + ;; normal; and trailing spaces have been seen in a few files. + ;; + ;; "Chatacter" is a typo seen in about 220 files as of 2005 (though + ;; only 38 are non-ASCII). + ;; + (re-search-forward + "^Cha[rt]acter set encoding:[ \t\r]*\\(\\([ \t\r]*[^ \t\r\n]+\\)*\\)" + ;; only search first 200 lines + (save-excursion (forward-line 200) (point)) + t) + + ;; The character set names are slightly free form. They're perfectly + ;; understandable to a human, but need some massaging to get + ;; something `locale-charset-to-coding-system' can handle. The stuff + ;; below was tested on the full set of files in 2005. + ;; + ;; Some readme.txt files have "MP3" or the like given as the + ;; character set, which is bogus, it refers to the existance of .mp3 + ;; files, the .txt is plain ascii. We let such cases get the warning + ;; message. + + (let* ((orig-charset (match-string 1)) + (charset (downcase orig-charset))) + + ;; "ascii" -> "us-ascii" + ;; "iso-646-us (us-ascii)" -> "us-ascii" + (if (member charset '("ascii" "iso-646-us (us-ascii)")) + (setq charset "us-ascii")) + + ;; "ascii, with a few iso-8859-1 characters" etc -> "iso-8859-1" + ;; "acii, with some iso-8859-1 characters" -> "iso-8859-1" + ;; the "acii" is a typo in dvptn10.txt, easy enough to allow it + (setq charset (replace-regexp-in-string + "^as?cii[ (,]*with.* \\(iso-8859-[0-9]+\\).*" + "\\1" charset t)) + + ;; "cp-1250" -> "windows-1250" + ;; "cp1251" -> "windows-1251" + ;; "codepage 1250" -> "windows-1250" + ;; "windows codepage 1252" -> "windows-1252" + ;; "windows code page 1252" -> "windows-1252" + (setq charset (replace-regexp-in-string + "^\\(cp\\|codepage\\|windows \\(code ?page\\)?\\)[ -]*" + "windows-" charset t t)) + + ;; "unicode" alone -> "utf-8", found in 10752-8.txt + (setq charset (replace-regexp-in-string "^unicode\r?$" "utf-8" + charset t)) + + ;; "unicode utf-8" -> "utf-8" + (setq charset (replace-regexp-in-string "^unicode utf" "utf" + charset t t)) + + ;; "unicode (utf-8)" -> "utf-8" + (setq charset (replace-regexp-in-string "^unicode (\\(.*\\))$" "\\1" + charset t)) + + ;; "iso-8858-1" -> "iso-8859-1", typo in 10439-8.txt + (setq charset (replace-regexp-in-string "8858" "8859" charset t t)) + + ;; "ido-8859-1" -> "iso-8859-1", typo in 10549-8.txt + (setq charset (replace-regexp-in-string "^ido-" "iso-" charset t t)) + + ;; "iso 8859-1 (latin-1)" -> "latin-1" + (setq charset (replace-regexp-in-string + "^iso 8859-\\([0-9]+\\) (\\(latin-\\1\\))$" + "\\2" charset t)) + + ;; "iso=8859-1" -> "iso-8859-1" + ;; "big 5" -> "big-5" + (setq charset (replace-regexp-in-string "[= ]" "-" charset t t)) + + (or (locale-charset-to-coding-system charset) + (progn + (message "Warning: unknown coding system \"%s\"" + orig-charset) + nil))))) (defun sgml-xml-auto-coding-function (size) "Determine whether the buffer is XML, and if so, its encoding. [-- Attachment #3: list.log --] [-- Type: text/plain, Size: 5124 bytes --] Character set encoding: ASCII eg. kimrk12.txt Character set encoding: ISO8859_1 eg. c1001107.txt Character set encoding: ACII, with some ISO-8859-1 characters eg. dvptn10.txt Character set encoding: ASCII eg. 10001.txt Character set encoding: ASCII eg. oh11v10.txt Character set encoding: ASCII, with 2 ISO-8859-1 characters eg. prpsl10.txt Character set encoding: ASCII, with a couple of ISO-8859-1 characters eg. jrcl610.txt Character set encoding: ASCII, with a few ISO-8859-1 characters eg. cnnet10.txt Character set encoding: ASCII (with a few ISO-8859-1 characters) eg. ltlbh10.txt Character set encoding: ASCII, with one ISO-8859-1 character eg. srhrl10.txt Character set encoding: ASCII, with some ISO-8859-1 characters eg. bough11.txt Character set encoding: ASCII, with two ISO-8859-1 characters eg. prphi10.txt Character set encoding: Big 5 eg. dxizi10.txt Character set encoding: BIG-5 eg. 8dxzj10.txt Character set encoding: Big5 eg. wesik10.txt Character set encoding: Codepage 1250 eg. sklep10.txt Character set encoding: CP-1250 eg. 13083-8.txt Character set encoding: CP-1251 eg. 14741-8.txt Character set encoding: CP-1252 eg. 12732-8.txt Character set encoding: CP1251 eg. 11292-8.txt Character set encoding: cp1251 eg. kknta10.txt Character set encoding: CP1252 eg. 8ledo10.txt Character set encoding: EUC-KR eg. kedct10.txt Character set encoding: IDO-8859-1 eg. 10549-8.txt Character set encoding: ISO-646-US (US-ASCII) eg. 107.txt Character set encoding: ISO-8858-1 eg. 10439-8.txt Character set encoding: ISO 8859-1 eg. 8bld410.txt Character set encoding: ISO-8859-1 eg. 10002-8.txt Character set encoding: iso-8859-1 eg. 10429-8.txt Character set encoding: ISO=8859-1 eg. 7fool10.txt Character set encoding: ISO 8859-1 (Latin-1) eg. 8adio10.txt Character set encoding: iso-8859-15 eg. 8dlrm10.txt Character set encoding: ISO-8859-2 eg. rnpz810.txt Character set encoding: ISO Latin-1 eg. 10056-8.txt Character set encoding: ISO-LATIN-1 eg. 8nggd10.txt Character set encoding: ISO-Latin-1 eg. hstrd10.txt Character set encoding: iso-Latin-1 eg. 8wpwl10.txt Character set encoding: iso-latin-1 eg. 8engl10.txt Character set encoding: ISO8859-1 eg. 7bjrn10.txt Character set encoding: ISO8859_1 eg. a1001107.txt Character set encoding: KOI8-R eg. ktria10.txt Character set encoding: Latin 1 eg. divrw10.txt Character set encoding: Latin-1 eg. 8dawn10.txt Character set encoding: Latin-4 eg. kalev10.txt Character set encoding: Latin1 eg. 10347-8.txt Character set encoding: MP3 eg. 10348-m-readme.txt Character set encoding: MPEG eg. atomi10m-readme.txt Character set encoding: MPEG Layer 3 (MP3) eg. 1donq3-readme.txt Character set encoding: Unicode eg. 10752-8.txt Character set encoding: Unicode (UTF-8) eg. orama10u.txt Character set encoding: Unicode UTF-8 eg. 11753-0.txt Character set encoding: US-ASCII eg. 10078.txt Character set encoding: US-ASCII, MIDI, Lilypond, MP3 and TeX eg. 10535.txt Character set encoding: UTF-16 eg. 13083-utf16.txt Character set encoding: UTF-7 eg. 8cart10.txt Character set encoding: UTF-8 eg. 10140-0.txt Character set encoding: utf-8 eg. astrl10.txt Character set encoding: UTF8 eg. 8gslt10.txt Character set encoding: Windows-1250 eg. 15201-8.txt Character set encoding: Windows 1251 eg. olavg10.txt Character set encoding: Windows-1252 eg. 8clcn10.txt Character set encoding: Windows Code Page 1252 eg. 8tjna10.txt Character set encoding: Windows Codepage 1252 eg. 8vepi10.txt Character set encoding: Windows1253 eg. orama10.txt Chatacter set encoding: ISO-8859-1 eg. 10021-8.txt Chatacter set encoding: iso-8859-1 eg. 10026-8.txt Chatacter set encoding: MP3 eg. 10137-m-readme.txt Chatacter set encoding: Sibelius 3 SIB format and MP3 audio eg. 10344-readme.txt Chatacter set encoding: US-ASCII eg. 10021.txt [-- Attachment #4: Type: text/plain, Size: 142 bytes --] _______________________________________________ Emacs-devel mailing list Emacs-devel@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-devel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gutenberg-coding.el -- coding system for Project Gutenberg files 2005-10-24 22:33 ` gutenberg-coding.el -- coding system for Project Gutenberg files Kevin Ryde @ 2005-10-25 2:11 ` Kenichi Handa 2005-10-25 3:53 ` Stefan Monnier 2005-10-25 20:27 ` Richard M. Stallman 0 siblings, 2 replies; 10+ messages in thread From: Kenichi Handa @ 2005-10-25 2:11 UTC (permalink / raw) Cc: rms, emacs-devel In article <87oe5eeea2.fsf@zip.com.au>, Kevin Ryde <user42@zip.com.au> writes: > [1 <text/plain (7bit)>] > This is my go at Project Gutenberg ebook/etext coding system detection > adapted to the emacs cvs. > The charset names in the texts are slightly free-form and need an > unhappy amount of massaging. "list.log" below is what I grepped out > of all the current files (about 23000 of them). > Some charset names are obvious typos (I reported them), but it doesn't > hurt to allow them. I think that the code is good to be included in Emacs. But, as it's not a bug fix, and I think not many people benefit from that (how many people read Gutenberg text file?). So, I'd like to ask Richard to decide whether we include it now or postpone it. --- Kenichi Handa handa@m17n.org > 2005-10-24 Kevin Ryde <user42@zip.com.au> > * international/mule.el (project-gutenberg-auto-coding-function): New > function. > (auto-coding-functions): Add it. > [2 mule.el.gutenberg.diff <text/plain (7bit)>] > Index: mule.el > =================================================================== > RCS file: /cvsroot/emacs/emacs/lisp/international/mule.el,v > retrieving revision 1.227 > diff -u -c -r1.227 mule.el > cvs server: conflicting specifications of output style > *** mule.el 23 Oct 2005 18:24:00 -0000 1.227 > --- mule.el 24 Oct 2005 22:06:19 -0000 > *************** > *** 1588,1594 **** > (symbol :tag "Coding system")))) > ;; See the bottom of this file for built-in auto coding functions. > ! (defcustom auto-coding-functions '(sgml-xml-auto-coding-function > sgml-html-meta-auto-coding-function) > "A list of functions which attempt to determine a coding system. > --- 1588,1595 ---- > (symbol :tag "Coding system")))) > ;; See the bottom of this file for built-in auto coding functions. > ! (defcustom auto-coding-functions '(project-gutenberg-auto-coding-function > ! sgml-xml-auto-coding-function > sgml-html-meta-auto-coding-function) > "A list of functions which attempt to determine a coding system. > *************** > *** 2204,2209 **** > --- 2205,2307 ---- > ;;; Built-in auto-coding-functions: > + > + (defun project-gutenberg-auto-coding-function (size) > + "Determine character encoding of a Project Gutenberg EBook/Etext. > + This function is designed for use in `auto-coding-functions'. > + > + A Project Gutenberg text has \"Project Gutenberg\" in the first line, and a > + subsequent \"Character set encoding:\" line. The latter gives the coding > + system. > + > + Some early non-ASCII texts don't have a \"Character set encoding:\", for > + those you have to use other Emacs mechanisms (eg. \\[universal-coding-system-argument]). > + > + See http://www.gutenberg.org for more about Project Gutenberg." > + > + (and (looking-at ".*Project Gutenberg") > + > + ;; The regexp here is "^Cha[rt]acter set encoding: *\\(.*\\)", except > + ;; tweaked to avoid trailing spaces and \r in the match-string. > + ;; > + ;; Project Gutenberg files are CRLF line endings (usually) so \r is > + ;; normal; and trailing spaces have been seen in a few files. > + ;; > + ;; "Chatacter" is a typo seen in about 220 files as of 2005 (though > + ;; only 38 are non-ASCII). > + ;; > + (re-search-forward > + "^Cha[rt]acter set encoding:[ \t\r]*\\(\\([ \t\r]*[^ \t\r\n]+\\)*\\)" > + ;; only search first 200 lines > + (save-excursion (forward-line 200) (point)) > + t) > + > + ;; The character set names are slightly free form. They're perfectly > + ;; understandable to a human, but need some massaging to get > + ;; something `locale-charset-to-coding-system' can handle. The stuff > + ;; below was tested on the full set of files in 2005. > + ;; > + ;; Some readme.txt files have "MP3" or the like given as the > + ;; character set, which is bogus, it refers to the existance of .mp3 > + ;; files, the .txt is plain ascii. We let such cases get the warning > + ;; message. > + > + (let* ((orig-charset (match-string 1)) > + (charset (downcase orig-charset))) > + > + ;; "ascii" -> "us-ascii" > + ;; "iso-646-us (us-ascii)" -> "us-ascii" > + (if (member charset '("ascii" "iso-646-us (us-ascii)")) > + (setq charset "us-ascii")) > + > + ;; "ascii, with a few iso-8859-1 characters" etc -> "iso-8859-1" > + ;; "acii, with some iso-8859-1 characters" -> "iso-8859-1" > + ;; the "acii" is a typo in dvptn10.txt, easy enough to allow it > + (setq charset (replace-regexp-in-string > + "^as?cii[ (,]*with.* \\(iso-8859-[0-9]+\\).*" > + "\\1" charset t)) > + > + ;; "cp-1250" -> "windows-1250" > + ;; "cp1251" -> "windows-1251" > + ;; "codepage 1250" -> "windows-1250" > + ;; "windows codepage 1252" -> "windows-1252" > + ;; "windows code page 1252" -> "windows-1252" > + (setq charset (replace-regexp-in-string > + "^\\(cp\\|codepage\\|windows \\(code ?page\\)?\\)[ -]*" > + "windows-" charset t t)) > + > + ;; "unicode" alone -> "utf-8", found in 10752-8.txt > + (setq charset (replace-regexp-in-string "^unicode\r?$" "utf-8" > + charset t)) > + > + ;; "unicode utf-8" -> "utf-8" > + (setq charset (replace-regexp-in-string "^unicode utf" "utf" > + charset t t)) > + > + ;; "unicode (utf-8)" -> "utf-8" > + (setq charset (replace-regexp-in-string "^unicode (\\(.*\\))$" "\\1" > + charset t)) > + > + ;; "iso-8858-1" -> "iso-8859-1", typo in 10439-8.txt > + (setq charset (replace-regexp-in-string "8858" "8859" charset t t)) > + > + ;; "ido-8859-1" -> "iso-8859-1", typo in 10549-8.txt > + (setq charset (replace-regexp-in-string "^ido-" "iso-" charset t t)) > + > + ;; "iso 8859-1 (latin-1)" -> "latin-1" > + (setq charset (replace-regexp-in-string > + "^iso 8859-\\([0-9]+\\) (\\(latin-\\1\\))$" > + "\\2" charset t)) > + > + ;; "iso=8859-1" -> "iso-8859-1" > + ;; "big 5" -> "big-5" > + (setq charset (replace-regexp-in-string "[= ]" "-" charset t t)) > + > + (or (locale-charset-to-coding-system charset) > + (progn > + (message "Warning: unknown coding system \"%s\"" > + orig-charset) > + nil))))) > (defun sgml-xml-auto-coding-function (size) > "Determine whether the buffer is XML, and if so, its encoding. > [3 list.log <text/plain (7bit)>] > Character set encoding: ASCII eg. kimrk12.txt > Character set encoding: ISO8859_1 eg. c1001107.txt > Character set encoding: ACII, with some ISO-8859-1 characters > eg. dvptn10.txt > Character set encoding: ASCII eg. 10001.txt > Character set encoding: ASCII > eg. oh11v10.txt > Character set encoding: ASCII, with 2 ISO-8859-1 characters > eg. prpsl10.txt > Character set encoding: ASCII, with a couple of ISO-8859-1 characters > eg. jrcl610.txt > Character set encoding: ASCII, with a few ISO-8859-1 characters > eg. cnnet10.txt > Character set encoding: ASCII (with a few ISO-8859-1 characters) > eg. ltlbh10.txt > Character set encoding: ASCII, with one ISO-8859-1 character > eg. srhrl10.txt > Character set encoding: ASCII, with some ISO-8859-1 characters > eg. bough11.txt > Character set encoding: ASCII, with two ISO-8859-1 characters > eg. prphi10.txt > Character set encoding: Big 5 eg. dxizi10.txt > Character set encoding: BIG-5 eg. 8dxzj10.txt > Character set encoding: Big5 eg. wesik10.txt > Character set encoding: Codepage 1250 eg. sklep10.txt > Character set encoding: CP-1250 eg. 13083-8.txt > Character set encoding: CP-1251 eg. 14741-8.txt > Character set encoding: CP-1252 eg. 12732-8.txt > Character set encoding: CP1251 eg. 11292-8.txt > Character set encoding: cp1251 eg. kknta10.txt > Character set encoding: CP1252 eg. 8ledo10.txt > Character set encoding: EUC-KR eg. kedct10.txt > Character set encoding: IDO-8859-1 eg. 10549-8.txt > Character set encoding: ISO-646-US (US-ASCII) eg. 107.txt > Character set encoding: ISO-8858-1 eg. 10439-8.txt > Character set encoding: ISO 8859-1 eg. 8bld410.txt > Character set encoding: ISO-8859-1 eg. 10002-8.txt > Character set encoding: iso-8859-1 eg. 10429-8.txt > Character set encoding: ISO=8859-1 eg. 7fool10.txt > Character set encoding: ISO 8859-1 (Latin-1) eg. 8adio10.txt > Character set encoding: iso-8859-15 eg. 8dlrm10.txt > Character set encoding: ISO-8859-2 eg. rnpz810.txt > Character set encoding: ISO Latin-1 eg. 10056-8.txt > Character set encoding: ISO-LATIN-1 eg. 8nggd10.txt > Character set encoding: ISO-Latin-1 eg. hstrd10.txt > Character set encoding: iso-Latin-1 eg. 8wpwl10.txt > Character set encoding: iso-latin-1 eg. 8engl10.txt > Character set encoding: ISO8859-1 eg. 7bjrn10.txt > Character set encoding: ISO8859_1 eg. a1001107.txt > Character set encoding: KOI8-R eg. ktria10.txt > Character set encoding: Latin 1 eg. divrw10.txt > Character set encoding: Latin-1 eg. 8dawn10.txt > Character set encoding: Latin-4 eg. kalev10.txt > Character set encoding: Latin1 eg. 10347-8.txt > Character set encoding: MP3 eg. 10348-m-readme.txt > Character set encoding: MPEG eg. atomi10m-readme.txt > Character set encoding: MPEG Layer 3 (MP3) eg. 1donq3-readme.txt > Character set encoding: Unicode eg. 10752-8.txt > Character set encoding: Unicode (UTF-8) eg. orama10u.txt > Character set encoding: Unicode UTF-8 eg. 11753-0.txt > Character set encoding: US-ASCII eg. 10078.txt > Character set encoding: US-ASCII, MIDI, Lilypond, MP3 and TeX > eg. 10535.txt > Character set encoding: UTF-16 eg. 13083-utf16.txt > Character set encoding: UTF-7 eg. 8cart10.txt > Character set encoding: UTF-8 eg. 10140-0.txt > Character set encoding: utf-8 eg. astrl10.txt > Character set encoding: UTF8 eg. 8gslt10.txt > Character set encoding: Windows-1250 eg. 15201-8.txt > Character set encoding: Windows 1251 eg. olavg10.txt > Character set encoding: Windows-1252 eg. 8clcn10.txt > Character set encoding: Windows Code Page 1252 eg. 8tjna10.txt > Character set encoding: Windows Codepage 1252 eg. 8vepi10.txt > Character set encoding: Windows1253 eg. orama10.txt > Chatacter set encoding: ISO-8859-1 eg. 10021-8.txt > Chatacter set encoding: iso-8859-1 eg. 10026-8.txt > Chatacter set encoding: MP3 eg. 10137-m-readme.txt > Chatacter set encoding: Sibelius 3 SIB format and MP3 audio > eg. 10344-readme.txt > Chatacter set encoding: US-ASCII eg. 10021.txt > [4 <text/plain; us-ascii (7bit)>] > _______________________________________________ > Emacs-devel mailing list > Emacs-devel@gnu.org > http://lists.gnu.org/mailman/listinfo/emacs-devel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gutenberg-coding.el -- coding system for Project Gutenberg files 2005-10-25 2:11 ` Kenichi Handa @ 2005-10-25 3:53 ` Stefan Monnier 2005-10-26 1:43 ` Kevin Ryde 2005-10-25 20:27 ` Richard M. Stallman 1 sibling, 1 reply; 10+ messages in thread From: Stefan Monnier @ 2005-10-25 3:53 UTC (permalink / raw) Cc: Kevin Ryde, rms, emacs-devel >> This is my go at Project Gutenberg ebook/etext coding system detection >> adapted to the emacs cvs. Since those files are not used very commonly I think it's difficult to justify the risk of checking for the presence of a Gutenberg "coding cookie" on each and every file. Don't those files have a (set of) standardized extensions? Stefan ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gutenberg-coding.el -- coding system for Project Gutenberg files 2005-10-25 3:53 ` Stefan Monnier @ 2005-10-26 1:43 ` Kevin Ryde 2005-10-27 1:29 ` Richard M. Stallman 0 siblings, 1 reply; 10+ messages in thread From: Kevin Ryde @ 2005-10-26 1:43 UTC (permalink / raw) Stefan Monnier <monnier@iro.umontreal.ca> writes: > > Since those files are not used very commonly I think it's difficult to > justify the risk of checking for the presence of a Gutenberg "coding cookie" > on each and every file. Hopefully the risk is small. Or put it this way, is there likely to be a file in the matched form, but which is in some encoding other than the one stated. I did want to do something tighter for the first-line match, or locate the end of the header info part, but the format varies too much. > Don't those files have a (set of) standardized extensions? The normal files (the ones I'm interested in) are .txt. (There's others like tex or html, txt is the majority.) (The downloads are available as .zip containing .txt, and I presume most people will use that, so something that works from archive-mode as well as a plain find-file is highly desirable.) ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gutenberg-coding.el -- coding system for Project Gutenberg files 2005-10-26 1:43 ` Kevin Ryde @ 2005-10-27 1:29 ` Richard M. Stallman 2005-10-27 2:04 ` Kevin Ryde 2005-10-27 23:57 ` Kevin Ryde 0 siblings, 2 replies; 10+ messages in thread From: Richard M. Stallman @ 2005-10-27 1:29 UTC (permalink / raw) Cc: emacs-devel > Don't those files have a (set of) standardized extensions? The normal files (the ones I'm interested in) are .txt. (There's others like tex or html, txt is the majority.) We could do this only for files called .txt, I suppose. That would not eliminate false matches, but would limit them. I did want to do something tighter for the first-line match, or locate the end of the header info part, but the format varies too much. How about if you send Project Gutenberg a suggestion asking them if they would make the format more standard, to help programs DTRT? If you offer one or two specific suggestions, that would be useful. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gutenberg-coding.el -- coding system for Project Gutenberg files 2005-10-27 1:29 ` Richard M. Stallman @ 2005-10-27 2:04 ` Kevin Ryde 2005-10-28 3:47 ` Richard M. Stallman 2005-10-27 23:57 ` Kevin Ryde 1 sibling, 1 reply; 10+ messages in thread From: Kevin Ryde @ 2005-10-27 2:04 UTC (permalink / raw) "Richard M. Stallman" <rms@gnu.org> writes: > > How about if you send Project Gutenberg a suggestion asking them if > they would make the format more standard, to help programs DTRT? > If you offer one or two specific suggestions, that would be useful. Yep. Though with about 16000 texts, and a fairly conservative update policy (from what I read in the faqs) I don't suppose that can happen immediately. Which is to say, I think there's merit in emacs working with what's presently published/mirrored/etc, if it can be done in an acceptable way. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gutenberg-coding.el -- coding system for Project Gutenberg files 2005-10-27 2:04 ` Kevin Ryde @ 2005-10-28 3:47 ` Richard M. Stallman 2005-11-10 21:06 ` Kevin Ryde 0 siblings, 1 reply; 10+ messages in thread From: Richard M. Stallman @ 2005-10-28 3:47 UTC (permalink / raw) Cc: emacs-devel Yep. Though with about 16000 texts, and a fairly conservative update policy (from what I read in the faqs) I don't suppose that can happen immediately. That could well be true. On the other hand, they would not have to do the update by hand, file by file. If they decide to change a format so as to standardize, they can run a script to do it. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gutenberg-coding.el -- coding system for Project Gutenberg files 2005-10-28 3:47 ` Richard M. Stallman @ 2005-11-10 21:06 ` Kevin Ryde 0 siblings, 0 replies; 10+ messages in thread From: Kevin Ryde @ 2005-11-10 21:06 UTC (permalink / raw) I send a mail to PG suggesting to help automated parsing, I didn't get a reply. I think it's worth working with what's available now, and see if something better is possible later. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gutenberg-coding.el -- coding system for Project Gutenberg files 2005-10-27 1:29 ` Richard M. Stallman 2005-10-27 2:04 ` Kevin Ryde @ 2005-10-27 23:57 ` Kevin Ryde 1 sibling, 0 replies; 10+ messages in thread From: Kevin Ryde @ 2005-10-27 23:57 UTC (permalink / raw) [-- Attachment #1: Type: text/plain, Size: 939 bytes --] "Richard M. Stallman" <rms@gnu.org> writes: > > We could do this only for files called .txt, I suppose. > That would not eliminate false matches, but would limit them. I realized that the only files needing to be matched are the non-ascii ones with a charset spec. Doh. So I think the test can be for a file starting with one of "Project Gutenberg " "Project Gutenberg's " "The Project Gutenberg " "**This is a COPYRIGHTED Project Gutenberg " and possibly with bytes 0xEF 0xBB 0xBF before those, which is in some (but not all) utf-8 files. This is tighter than just "Project Gutenberg" anywhere in the first line. New diff below. I put just "..." to match the three marker bytes. I'd like to put those exactly, but it will be matched against a unibyte buffer (if I'm not mistaken), and I'm unsure how to give a unibyte string literal, or a multibyte which will match correctly. (A couple of things I tried didn't work.) [-- Attachment #2: mule.el.gutenberg-2.diff --] [-- Type: text/plain, Size: 6207 bytes --] Index: mule.el =================================================================== RCS file: /cvsroot/emacs/emacs/lisp/international/mule.el,v retrieving revision 1.227 diff -u -c -r1.227 mule.el cvs server: conflicting specifications of output style *** mule.el 23 Oct 2005 18:24:00 -0000 1.227 --- mule.el 27 Oct 2005 23:56:29 -0000 *************** *** 1588,1594 **** (symbol :tag "Coding system")))) ;; See the bottom of this file for built-in auto coding functions. ! (defcustom auto-coding-functions '(sgml-xml-auto-coding-function sgml-html-meta-auto-coding-function) "A list of functions which attempt to determine a coding system. --- 1588,1595 ---- (symbol :tag "Coding system")))) ;; See the bottom of this file for built-in auto coding functions. ! (defcustom auto-coding-functions '(project-gutenberg-auto-coding-function ! sgml-xml-auto-coding-function sgml-html-meta-auto-coding-function) "A list of functions which attempt to determine a coding system. *************** *** 2204,2209 **** --- 2205,2315 ---- ;;; Built-in auto-coding-functions: + + (defun project-gutenberg-auto-coding-function (size) + "Determine character encoding of a Project Gutenberg EBook/Etext. + This function is designed for use in `auto-coding-functions'. + + A Project Gutenberg text has \"Project Gutenberg\" in the first line, and a + subsequent \"Character set encoding:\" line. The latter gives the coding + system. + + Some early non-ASCII texts don't have a \"Character set encoding:\", for + those you have to use other Emacs mechanisms (eg. \\[universal-coding-system-argument]). + + See http://www.gutenberg.org for more about Project Gutenberg." + + ;; This regexp identifies a gutenberg file, it's kept fairly tight to + ;; avoid false matches. + ;; + ;; Many early gutenberg files have different first lines, but the + ;; alternatives here are enough for the non-ascii files existing in 2005. + ;; + ;; Some (but not all) utf-8 files begin with a marker sequence EF BB BF. + + (and (looking-at "\\(...\\)?\\(Project Gutenberg\\('s\\)?\\|The Project Gutenberg\\|\\**This is a COPYRIGHTED Project Gutenberg\\) ") + + ;; The regexp here is "^Cha[rt]acter set encoding: *\\(.*\\)", except + ;; tweaked to avoid trailing spaces and \r in the match-string. + ;; + ;; Project Gutenberg files are CRLF line endings (usually) so \r is + ;; normal; and trailing spaces have been seen in a few files. + ;; + ;; "Chatacter" is a typo seen in about 220 files as of 2005 (though + ;; only 38 are non-ASCII). + ;; + (re-search-forward + "^Cha[rt]acter set encoding:[ \t\r]*\\(\\([ \t\r]*[^ \t\r\n]+\\)*\\)" + ;; only search first 200 lines + (save-excursion (forward-line 200) (point)) + t) + + ;; The character set names are slightly free form. They're perfectly + ;; understandable to a human, but need some massaging to get + ;; something `locale-charset-to-coding-system' can handle. The stuff + ;; below was tested on the full set of files in 2005. + ;; + ;; Some readme.txt files have "MP3" or the like given as the + ;; character set, which is bogus, it refers to the existance of .mp3 + ;; files, the .txt is plain ascii. We let such cases get the warning + ;; message. + + (let* ((orig-charset (match-string 1)) + (charset (downcase orig-charset))) + + ;; "ascii" -> "us-ascii" + ;; "iso-646-us (us-ascii)" -> "us-ascii" + (if (member charset '("ascii" "iso-646-us (us-ascii)")) + (setq charset "us-ascii")) + + ;; "ascii, with a few iso-8859-1 characters" etc -> "iso-8859-1" + ;; "acii, with some iso-8859-1 characters" -> "iso-8859-1" + ;; the "acii" is a typo in dvptn10.txt, easy enough to allow it + (setq charset (replace-regexp-in-string + "^as?cii[ (,]*with.* \\(iso-8859-[0-9]+\\).*" + "\\1" charset t)) + + ;; "cp-1250" -> "windows-1250" + ;; "cp1251" -> "windows-1251" + ;; "codepage 1250" -> "windows-1250" + ;; "windows codepage 1252" -> "windows-1252" + ;; "windows code page 1252" -> "windows-1252" + (setq charset (replace-regexp-in-string + "^\\(cp\\|codepage\\|windows \\(code ?page\\)?\\)[ -]*" + "windows-" charset t t)) + + ;; "unicode" alone -> "utf-8", found in 10752-8.txt + (setq charset (replace-regexp-in-string "^unicode\r?$" "utf-8" + charset t)) + + ;; "unicode utf-8" -> "utf-8" + (setq charset (replace-regexp-in-string "^unicode utf" "utf" + charset t t)) + + ;; "unicode (utf-8)" -> "utf-8" + (setq charset (replace-regexp-in-string "^unicode (\\(.*\\))$" "\\1" + charset t)) + + ;; "iso-8858-1" -> "iso-8859-1", typo in 10439-8.txt + (setq charset (replace-regexp-in-string "8858" "8859" charset t t)) + + ;; "ido-8859-1" -> "iso-8859-1", typo in 10549-8.txt + (setq charset (replace-regexp-in-string "^ido-" "iso-" charset t t)) + + ;; "iso 8859-1 (latin-1)" -> "latin-1" + (setq charset (replace-regexp-in-string + "^iso 8859-\\([0-9]+\\) (\\(latin-\\1\\))$" + "\\2" charset t)) + + ;; "iso=8859-1" -> "iso-8859-1" + ;; "big 5" -> "big-5" + (setq charset (replace-regexp-in-string "[= ]" "-" charset t t)) + + (or (locale-charset-to-coding-system charset) + (progn + (message "Warning: unknown coding system \"%s\"" + orig-charset) + nil))))) (defun sgml-xml-auto-coding-function (size) "Determine whether the buffer is XML, and if so, its encoding. [-- Attachment #3: Type: text/plain, Size: 142 bytes --] _______________________________________________ Emacs-devel mailing list Emacs-devel@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-devel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gutenberg-coding.el -- coding system for Project Gutenberg files 2005-10-25 2:11 ` Kenichi Handa 2005-10-25 3:53 ` Stefan Monnier @ 2005-10-25 20:27 ` Richard M. Stallman 1 sibling, 0 replies; 10+ messages in thread From: Richard M. Stallman @ 2005-10-25 20:27 UTC (permalink / raw) Cc: user42, emacs-devel I think that the code is good to be included in Emacs. But, as it's not a bug fix, and I think not many people benefit from that (how many people read Gutenberg text file?). So, I'd like to ask Richard to decide whether we include it now or postpone it. It fixes something--whether to call it a bug or a gap is not clear to me. And it seems harmless. So there is no reason to wait for the release. However, the issue of possible false matches that Stefan raised is a something we should look at first. ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2005-11-10 21:06 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <87u0ff9387.fsf@zip.com.au> [not found] ` <E1ES3v6-0006dW-M6@fencepost.gnu.org> [not found] ` <87wtk9uqcq.fsf@zip.com.au> [not found] ` <E1ESdmi-0002fk-Pp@fencepost.gnu.org> 2005-10-24 22:33 ` gutenberg-coding.el -- coding system for Project Gutenberg files Kevin Ryde 2005-10-25 2:11 ` Kenichi Handa 2005-10-25 3:53 ` Stefan Monnier 2005-10-26 1:43 ` Kevin Ryde 2005-10-27 1:29 ` Richard M. Stallman 2005-10-27 2:04 ` Kevin Ryde 2005-10-28 3:47 ` Richard M. Stallman 2005-11-10 21:06 ` Kevin Ryde 2005-10-27 23:57 ` Kevin Ryde 2005-10-25 20:27 ` Richard M. Stallman
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.