From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Colin Walters Newsgroups: gmane.emacs.devel Subject: auto-detecting encoding for XML Date: 18 May 2002 22:27:51 -0400 Sender: emacs-devel-admin@gnu.org Message-ID: <1021775271.29752.2282.camel@space-ghost> NNTP-Posting-Host: localhost.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-HsYSTSVNqq8swm1l0NZJ" X-Trace: main.gmane.org 1021835535 21691 127.0.0.1 (19 May 2002 19:12:15 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Sun, 19 May 2002 19:12:15 +0000 (UTC) Return-path: Original-Received: from quimby.gnus.org ([80.91.224.244]) by main.gmane.org with esmtp (Exim 3.33 #1 (Debian)) id 179W6J-0005dk-00 for ; Sun, 19 May 2002 21:12:15 +0200 Original-Received: from fencepost.gnu.org ([199.232.76.164]) by quimby.gnus.org with esmtp (Exim 3.12 #1 (Debian)) id 179WJu-00087d-00 for ; Sun, 19 May 2002 21:26:18 +0200 Original-Received: from localhost ([127.0.0.1] helo=fencepost.gnu.org) by fencepost.gnu.org with esmtp (Exim 3.34 #1 (Debian)) id 179W6H-0008G5-00; Sun, 19 May 2002 15:12:13 -0400 Original-Received: from monk.debian.net ([216.185.54.61] helo=monk.verbum.org) by fencepost.gnu.org with esmtp (Exim 3.34 #1 (Debian)) id 179W5S-00086Y-00 for ; Sun, 19 May 2002 15:11:22 -0400 Original-Received: from space-ghost.verbum.private (freedom.cis.ohio-state.edu [164.107.60.183]) (using TLSv1 with cipher EDH-RSA-DES-CBC3-SHA (168/168 bits)) (Client CN "space-ghost.verbum.org", Issuer "monk.verbum.org" (verified OK)) by monk.verbum.org (Postfix (Debian/GNU)) with ESMTP id 5BDC57400274 for ; Sun, 19 May 2002 15:11:06 -0400 (EDT) Original-Received: by space-ghost.verbum.private (Postfix (Debian/GNU), from userid 1000) id 0FA3B806B92; Sat, 18 May 2002 22:27:51 -0400 (EDT) Original-To: emacs-devel@gnu.org X-Mailer: Ximian Evolution 1.0.3 Errors-To: emacs-devel-admin@gnu.org X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.0.9 Precedence: bulk List-Help: List-Post: List-Subscribe: , List-Id: Emacs development discussions. List-Unsubscribe: , List-Archive: Xref: main.gmane.org gmane.emacs.devel:4128 X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:4128 --=-HsYSTSVNqq8swm1l0NZJ Content-Type: text/plain Content-Transfer-Encoding: 7bit Hi, Someone asked on the xml-resume-devel list about the problems they'd had because Emacs decided that their XML file (encoded in utf-8) should be encoded using `raw-text'. XML has an optional encoding="foo" parameter, used like: Wouldn't it be nice if Emacs could see that and automatically Do The Right Thing? The attached patch is an implementation. The reason why I had to put `sgml-xml-auto-coding-function' inside mule.el is because 1) we can't autoload it from sgml-mode.el, since that would be more or less equivalent to always loading sgml-mode.el, and 2) If we added it to `auto-coding-functions' when sgml-mode.el was loaded, then it wouldn't work the first time a user visited an XML file, since the encoding has already been determined. (Subsequently it would work, though). 2002-05-18 Colin Walters * international/mule.el (make-coding-system): Doc fixes. * international/mule.el (auto-coding-functions): New variable. (auto-coding-from-file-contents): Use it. (set-auto-coding): Update docs. (sgml-xml-auto-coding-function): New function. --=-HsYSTSVNqq8swm1l0NZJ Content-Disposition: attachment; filename=mule.patch Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; name=mule.patch; charset=ISO-8859-1 --- man/mule.texi.~1.60.~ Fri Apr 5 04:37:54 2002 +++ man/mule.texi Sat May 18 21:35:48 2002 @@ -793,17 +793,19 @@ =20 @vindex auto-coding-alist @vindex auto-coding-regexp-alist - The variables @code{auto-coding-alist} and -@code{auto-coding-regexp-alist} are the strongest way to specify the -coding system for certain patterns of file names, or for files -containing certain patterns; these variables even override -@samp{-*-coding:-*-} tags in the file itself. Emacs uses -@code{auto-coding-alist} for tar and archive files, to prevent it +@vindex auto-coding-functions + The variables @code{auto-coding-alist}, +@code{auto-coding-regexp-alist} and @code{auto-coding-functions} are +the strongest way to specify the coding system for certain patterns of +file names, or for files containing certain patterns; these variables +even override @samp{-*-coding:-*-} tags in the file itself. Emacs +uses @code{auto-coding-alist} for tar and archive files, to prevent it from being confused by a @samp{-*-coding:-*-} tag in a member of the archive and thinking it applies to the archive file as a whole. Likewise, Emacs uses @code{auto-coding-regexp-alist} to ensure that -RMAIL files, whose names in general don't match any particular pattern, -are decoded correctly. +RMAIL files, whose names in general don't match any particular +pattern, are decoded correctly. One of the builtin +@code{auto-coding-functions} detects the encoding for XML files. =20 If Emacs recognizes the encoding of a file incorrectly, you can reread the file using the correct coding system by typing @kbd{C-x --- lisp/international/mule.el.~1.141.~ Tue Feb 26 11:27:47 2002 +++ lisp/international/mule.el Sat May 18 22:16:00 2002 @@ -725,9 +725,9 @@ =20 TYPE is an integer value indicating the type of the coding system as follo= ws: 0: Emacs internal format, - 1: Shift-JIS (or MS-Kanji) used mainly on Japanese PC, + 1: Shift-JIS (or MS-Kanji) used mainly on Japanese PCs, 2: ISO-2022 including many variants, - 3: Big5 used mainly on Chinese PC, + 3: Big5 used mainly on Chinese PCs, 4: private, CCL programs provide encoding/decoding algorithm, 5: Raw-text, which means that text contains random 8-bit codes. =20 @@ -822,7 +822,7 @@ =20 o mime-charset =20 - The value is a symbol of which name is `MIME-charset' parameter of + The value is a symbol whose name is the `MIME-charset' parameter of the coding system. =20 o valid-codes (meaningful only for a coding system based on CCL) @@ -1488,6 +1488,22 @@ :type '(repeat (cons (regexp :tag "Regexp") (symbol :tag "Coding system")))) =20 +;; See the bottom of this file for built-in auto coding functions. +(defcustom auto-coding-functions '(sgml-xml-auto-coding-function) + "A list of functions which attempt to determine a coding system. + +Each function in this list should be written to operate on the current +buffer, but should not modify it in any way. It should take one +argument SIZE, past which it should not search. If a function +succeeds in determining a coding system, it should return that coding +system. Otherwise, it should return nil. + +The functions in this list take priority over `coding:' tags in the +file, just as for `auto-coding-regexp-alist'." + :group 'files + :group 'mule + :type '(repeat function)) + (defvar set-auto-coding-for-load nil "Non-nil means look for `load-coding' property instead of `coding'. This is used for loading and byte-compiling Emacs Lisp files.") @@ -1503,21 +1519,25 @@ (setq alist (cdr alist)))) coding-system)) =20 - (defun auto-coding-from-file-contents (size) "Determine a coding system from the contents of the current buffer. The current buffer contains SIZE bytes starting at point. Value is either a coding system or nil." (save-excursion (let ((alist auto-coding-regexp-alist) + (funcs auto-coding-functions) coding-system) (while (and alist (not coding-system)) (let ((regexp (car (car alist)))) (when (re-search-forward regexp (+ (point) size) t) (setq coding-system (cdr (car alist))))) (setq alist (cdr alist))) + (while (and funcs (not coding-system)) + (setq coding-system (condition-case e + (save-excursion + (funcall (pop funcs) size)) + (error nil)))) coding-system))) - =09 =20 (defun set-auto-coding (filename size) "Return coding system for a file FILENAME of which SIZE bytes follow poi= nt. @@ -1527,7 +1547,8 @@ It checks FILENAME against the variable `auto-coding-alist'. If FILENAME doesn't match any entries in the variable, it checks the contents of the current buffer following point against -`auto-coding-regexp-alist'. If no match is found, it checks for a +`auto-coding-regexp-alist', and tries calling each function in +`auto-coding-functions'. If no match is found, it checks for a `coding:' tag in the first one or two lines following point. If no `coding:' tag is found, it checks for local variables list in the last 3K bytes out of the SIZE bytes. @@ -1896,6 +1917,28 @@ (put 'ignore-relative-composition 'char-table-extra-slots 0) (setq ignore-relative-composition (make-char-table 'ignore-relative-composition)) + + +;;; Built-in auto-coding-functions: + +(defun sgml-xml-auto-coding-function (size) + "Determine whether the buffer is XML, and if so, its encoding. +This function is intended to be added to `auto-coding-functions'." + (when (re-search-forward "\\s-*<\\?xml" size t) + (let ((end (save-excursion + ;; This is a hack. + (search-forward "?>" size t)))) + (when end + (if (re-search-forward "encoding=3D\"\\(.+?\\)\"" end t) + (let ((match (downcase (match-string 1)))) + ;; FIXME: what other encodings are valid, and how can we + ;; translate them to the names of coding systems? + (cond ((string=3D match "utf-8") + 'utf-8) + ((string-match "iso-8859-[[:digit:]]+" match) + (intern match)) + (t nil))) + 'utf-8))))) =20 ;;; (provide 'mule) --=-HsYSTSVNqq8swm1l0NZJ--