unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* po file charset via auto-coding-functions
@ 2005-10-20 21:06 Kevin Ryde
  2005-10-21  2:18 ` Kenichi Handa
  2005-10-21  4:49 ` Richard M. Stallman
  0 siblings, 2 replies; 42+ messages in thread
From: Kevin Ryde @ 2005-10-20 21:06 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 2060 bytes --]

This is a proposal to get the coding system for a .po file via
auto-coding-functions, instead of the way textmodes/po.el reads the
file explicitly.

2005-10-20  Kevin Ryde  <user42@zip.com.au>

        * international/mule.el (po-content-type-charset-alist): Moved from
        textmodes.el, add "CHARSET" which is a placeholder from xgettext.
        (po-auto-coding-function): New function.  This gets the right coding
	system when visiting a .po via archive-mode; po-find-file-coding-system
        only worked on a normal file.  charset= regexp from textmodes/po.el.
        (auto-coding-functions): Use po-auto-coding-function.
        * international/mule-conf.el (file-coding-system-alist): Remove
        po-find-file-coding-system.
        * textmodes/po.el: Remove file, no longer used.


One possible problem is that po files can have more than 1024 bytes of
comments before the header info block.  I see fileio.c
Finsert_file_contents only grabs 1024 bytes before calling
set-auto-coding, but I can't tell if/when that happens.  I think a
normal visit or an `archive-extract' has the whole file, so they work.


I used the following bit of code to exercise po-auto-coding-function
on all my .po files.  The function prints messages about bad charsets,
the result is a list of the bad files.

(delq nil
      (mapcar (lambda (filename)
                (with-temp-buffer
                  (insert-file-contents-literally filename)
                  (goto-char (point-min))
                  (if (po-auto-coding-function (- (point-max) (point-min)))
                      nil
                    filename)))
              (delete "" (split-string
                          (shell-command-to-string "locate \\*.po") "\n"))))

Among my files I found two unrecognised:

"TCVN-5712" in gtk 1.2 vietnamese.  Is there a good place to map or
alias that to `tcvn' which emacs knows?

"iso-8859-9e" in gtk 1.2 Azerbaijani turkish, but I don't know what
that charset is or is meant to be.  glibc iconv doesn't seem to
recognise it, so presumably it's unused.



[-- Attachment #2: mule.el.po-coding.diff --]
[-- Type: text/plain, Size: 3146 bytes --]

*** mule.el.~1.226.~	2005-09-29 09:23:59.000000000 +1000
--- mule.el	2005-10-21 06:54:07.785993736 +1000
***************
*** 1586,1592 ****
  		       (symbol :tag "Coding system"))))
  
  ;; See the bottom of this file for built-in auto coding functions.
! (defcustom auto-coding-functions '(sgml-xml-auto-coding-function
  				   sgml-html-meta-auto-coding-function)
    "A list of functions which attempt to determine a coding system.
  
--- 1586,1593 ----
  		       (symbol :tag "Coding system"))))
  
  ;; See the bottom of this file for built-in auto coding functions.
! (defcustom auto-coding-functions '(po-auto-coding-function
!                                    sgml-xml-auto-coding-function
  				   sgml-html-meta-auto-coding-function)
    "A list of functions which attempt to determine a coding system.
  
***************
*** 2203,2208 ****
--- 2204,2254 ----
  
  ;;; Built-in auto-coding-functions:
  
+ (defconst po-content-type-charset-alist
+   '(("ASCII" . undecided)
+     ("ANSI_X3.4-1968" . undecided)
+     ("US-ASCII" . undecided)
+     ;; "charset=CHARSET" is generated by xgettext, and may be present before
+     ;; someone fills in their target charset.  `undecided' should be right.
+     ("CHARSET" . undecided))
+   "Alist of coding system versus GNU libc/libiconv canonical charset name.
+ Contains canonical charset names that don't correspond to coding systems.")
+ 
+ (defun po-auto-coding-function (size)
+   "Determine character encoding of a gettext .po or .pot file.
+ This function is designed for use in `auto-coding-functions'.
+ 
+ A po file starts with msgstr \"\" which has header information, in
+ particular \"Content-Type: text/plain; charset=ASCII\\n\" or whatever."
+ 
+   ;; Skip "#" comment lines and whitespace-only lines, then want
+   ;;     msgstr ""
+   ;;     msgid ""
+   ;; which, up to the next blank line, is the header info.
+   ;;
+   (and (looking-at
+         "\\(#.*\n\\|[ \t\r]*\n\\)*msgid \"\"[ \t\r]*\nmsgstr \"\"[ \t\r]*\n")
+        (let ((limit (+ (point) size)))
+          (save-excursion
+            (goto-char (match-end 0))
+ 
+            ;; Blank line is the end of the header, stop searching there, or
+            ;; at existing `limit' if the file is a header only.
+            (setq limit (or (save-excursion
+                              (re-search-forward "\n[ \t\r]*\n" limit t))
+                            limit))
+ 
+            (and (re-search-forward "^\"Content-Type:[ \t]*text/plain;[ \t]*charset=\\(.*\\)\\\\n\"" limit t)
+                 (let ((charset (match-string 1)))
+ 
+                   (or (cdr (assoc-string charset po-content-type-charset-alist
+                                          t))
+                       (locale-charset-to-coding-system charset)
+                       (progn
+                         (message "Warning: unknown coding system \"%s\""
+                                  charset)
+                         nil))))))))
+ 
  (defun sgml-xml-auto-coding-function (size)
    "Determine whether the buffer is XML, and if so, its encoding.
  This function is intended to be added to `auto-coding-functions'."

[-- Attachment #3: mule-conf.el.po-coding.diff --]
[-- Type: text/plain, Size: 493 bytes --]

*** mule-conf.el.~1.82.~	2005-08-04 06:28:21.000000000 +1000
--- mule-conf.el	2005-10-20 17:45:02.000000000 +1000
***************
*** 502,508 ****
  	;; the beginning of a doc string, work.
  	("\\(\\`\\|/\\)loaddefs.el\\'" . (raw-text . raw-text-unix))
  	("\\.tar\\'" . (no-conversion . no-conversion))
- 	( "\\.po[tx]?\\'\\|\\.po\\." . po-find-file-coding-system)
  	("\\.\\(tex\\|ltx\\|dtx\\|drv\\)\\'" . latexenc-find-file-coding-system)
  	("" . (undecided . nil))))
  
--- 502,507 ----

[-- Attachment #4: Type: text/plain, Size: 142 bytes --]

_______________________________________________
Emacs-devel mailing list
Emacs-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2006-01-04  4:37 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-10-20 21:06 po file charset via auto-coding-functions Kevin Ryde
2005-10-21  2:18 ` Kenichi Handa
2005-10-21 22:46   ` Kevin Ryde
2005-10-22  1:43     ` Kenichi Handa
2005-10-22  2:01       ` Kevin Ryde
2005-10-22  2:39         ` Kenichi Handa
2005-10-22  2:50           ` Stefan Monnier
2005-10-22 22:44             ` Kevin Ryde
2005-10-24  1:39             ` Kenichi Handa
2005-10-22 15:51           ` Richard M. Stallman
2005-10-24  2:05             ` Kenichi Handa
2005-10-25 15:59               ` Richard M. Stallman
2005-11-02 10:27               ` Richard Stallman
2005-11-10  2:09               ` Richard Stallman
2005-11-10  3:49                 ` Stefan Monnier
2005-11-10 17:49                   ` Richard M. Stallman
2005-11-10 18:33                     ` Stefan Monnier
2005-11-11  7:42                       ` Richard M. Stallman
2005-11-18 13:08                         ` Kenichi Handa
2005-11-18 17:21                           ` Stefan Monnier
2005-11-19  0:30                             ` Kenichi Handa
2005-11-20  1:16                             ` Juri Linkov
2005-11-29 19:13                               ` Kevin Rodgers
2005-11-30  2:45                                 ` Juri Linkov
2005-11-30 19:01                                 ` Richard M. Stallman
2005-11-19 23:27                           ` Richard M. Stallman
2005-11-20 12:05                             ` Kenichi Handa
2005-12-28 17:01                               ` Richard M. Stallman
2005-12-29 11:47                                 ` Kenichi Handa
2005-12-30  2:18                                   ` Richard M. Stallman
2006-01-04  4:37                                     ` Kenichi Handa
2005-10-22 22:51       ` Kevin Ryde
2005-10-24  1:53         ` Kenichi Handa
2005-10-24  2:04           ` Kevin Ryde
2005-10-24  5:19             ` Kenichi Handa
2005-10-24 14:11               ` Stefan Monnier
2005-10-25  1:03                 ` Kenichi Handa
2005-10-24 23:35               ` Juri Linkov
2005-10-25  6:42                 ` Kenichi Handa
2005-10-25 20:27                 ` Richard M. Stallman
2005-10-21  4:49 ` Richard M. Stallman
2005-10-21 21:07   ` Kevin Ryde

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).