unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Handling invalid HTML
@ 2005-10-18  8:06 Juri Linkov
  2005-10-19  2:43 ` Richard M. Stallman
  0 siblings, 1 reply; 5+ messages in thread
From: Juri Linkov @ 2005-10-18  8:06 UTC (permalink / raw)


Current rules of recognizing HTML files in Emacs are too strict:

1. The valid string delimiter for HTML attribute values is the
quotation character.  However, some HTML files on the Web use
apostrophes, e.g.

<meta http-equiv='Content-Type' content='text/html; charset=UTF-8'>

The program that generates such non-standard meta headers is identified
as 'Microsoft DHTML Editing Control' (no surprise).

`sgml-html-meta-auto-coding-function' can't determine encoding from
such invalid meta headers.  I propose to replace \" with [\"']
in regexps in `sgml-html-meta-auto-coding-function' to accept
such invalid HTML.  (The regexps in other function
`sgml-xml-auto-coding-function' already match [\"'] for XML files).

2. `sgml-html-meta-auto-coding-function' can't determine encoding when
HTML file has no `<html>' starting element.  An example of such HTML
file is the Mozilla Firefox bookmark file.  Sometimes it's needed
to open this file in Emacs and to use isearch on it, but Emacs can't
detect its encoding.  Perhaps the test `(search-forward "<html" size t)'
should be removed from `sgml-html-meta-auto-coding-function'.

3. Visiting Mozilla Firefox bookmark file in Emacs also can't detect
the type of this file.  Emacs opens it in SGML mode whereas it is
actually HTML file.  This problem is caused by the default value of
`magic-mode-alist'.  Maybe the `.html' extension in `auto-mode-alist'
should take precedence over `magic-mode-alist'?

-- 
Juri Linkov
http://www.jurta.org/emacs/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Handling invalid HTML
       [not found] <E1ERnvt-0006VR-9S@monty-python.gnu.org>
@ 2005-10-18 15:05 ` Jonathan Yavner
  2005-10-19 15:59   ` Juri Linkov
  0 siblings, 1 reply; 5+ messages in thread
From: Jonathan Yavner @ 2005-10-18 15:05 UTC (permalink / raw)


Juri Linkov writes:
> 1. The valid string delimiter for HTML attribute values is the
> quotation character.  However, some HTML files on the Web use
> apostrophes, e.g.
>    <meta http-equiv='Content-Type' content='text/html; charset=UTF-8'>
> The program that generates such non-standard meta headers is
> identified as 'Microsoft DHTML Editing Control' (no surprise).

http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.2.2
    "By default, SGML requires that all attribute values be delimited
    using either double quotation marks (ASCII decimal 34) or single
    quotation marks (ASCII decimal 39). ... In certain cases, authors
    may specify the value of an attribute without any quotation marks."

In XHTML the no-marks case was eliminated, but the use of 'apostrophes' 
is still valid.  There are many complaints one can make about 
Microsoft, but this isn't one of them.

--Jonathan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Handling invalid HTML
  2005-10-18  8:06 Handling invalid HTML Juri Linkov
@ 2005-10-19  2:43 ` Richard M. Stallman
  0 siblings, 0 replies; 5+ messages in thread
From: Richard M. Stallman @ 2005-10-19  2:43 UTC (permalink / raw)
  Cc: emacs-devel

    3. Visiting Mozilla Firefox bookmark file in Emacs also can't detect
    the type of this file.  Emacs opens it in SGML mode whereas it is
    actually HTML file.  This problem is caused by the default value of
    `magic-mode-alist'.  Maybe the `.html' extension in `auto-mode-alist'
    should take precedence over `magic-mode-alist'?

That would be the tail wagging the dog.

If there is a suitable criterion to test, that would give the right
results, whatever function is called through magic-mode-alist can test
that criterion.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Handling invalid HTML
  2005-10-18 15:05 ` Jonathan Yavner
@ 2005-10-19 15:59   ` Juri Linkov
  2005-10-20  4:54     ` Richard M. Stallman
  0 siblings, 1 reply; 5+ messages in thread
From: Juri Linkov @ 2005-10-19 15:59 UTC (permalink / raw)
  Cc: emacs-devel

> http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.2.2
>     "By default, SGML requires that all attribute values be delimited
>     using either double quotation marks (ASCII decimal 34) or single
>     quotation marks (ASCII decimal 39). ... In certain cases, authors
>     may specify the value of an attribute without any quotation marks."
>
> In XHTML the no-marks case was eliminated, but the use of 'apostrophes' 
> is still valid.  There are many complaints one can make about 
> Microsoft, but this isn't one of them.

I still see no reason for them to generate HTML files with such an
uncommon syntax, if not for making the life of users harder.

Anyway, the following patch will allow Emacs to recognize encoding
with either quotation marks (and for the attribute `content-type'
quotation marks are optional):

Index: lisp/international/mule.el
===================================================================
RCS file: /cvsroot/emacs/emacs/lisp/international/mule.el,v
retrieving revision 1.226
diff -c -r1.226 mule.el
*** lisp/international/mule.el	24 Sep 2005 13:43:59 -0000	1.226
--- lisp/international/mule.el	19 Oct 2005 15:57:28 -0000
***************
*** 2229,2242 ****
  		  (save-excursion
  		    (forward-line 10)
  		    (point))))
!   (when (and (search-forward "<html" size t)
! 	     (re-search-forward "<meta\\s-+http-equiv=\"content-type\"\\s-+content=\"text/\\sw+;\\s-*charset=\\(.+?\\)\"" size t))
!       (let* ((match (match-string 1))
! 	     (sym (intern (downcase match))))
! 	(if (coding-system-p sym)
! 	    sym
! 	  (message "Warning: unknown coding system \"%s\"" match)
! 	  nil))))
  
  ;;;
  (provide 'mule)
--- 2233,2245 ----
  		  (save-excursion
  		    (forward-line 10)
  		    (point))))
!   (when (re-search-forward "<meta\\s-+http-equiv=[\"']?content-type[\"']?\\s-+content=[\"']text/\\sw+;\\s-*charset=\\(.+?\\)[\"']" size t)
!     (let* ((match (match-string 1))
! 	   (sym (intern (downcase match))))
!       (if (coding-system-p sym)
! 	  sym
! 	(message "Warning: unknown coding system \"%s\"" match)
! 	nil))))
  
  ;;;
  (provide 'mule)

-- 
Juri Linkov
http://www.jurta.org/emacs/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Handling invalid HTML
  2005-10-19 15:59   ` Juri Linkov
@ 2005-10-20  4:54     ` Richard M. Stallman
  0 siblings, 0 replies; 5+ messages in thread
From: Richard M. Stallman @ 2005-10-20  4:54 UTC (permalink / raw)
  Cc: jyavner, emacs-devel

    Anyway, the following patch will allow Emacs to recognize encoding
    with either quotation marks (and for the attribute `content-type'
    quotation marks are optional):

If no one objects in the next week, would you please install it?

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2005-10-20  4:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-10-18  8:06 Handling invalid HTML Juri Linkov
2005-10-19  2:43 ` Richard M. Stallman
     [not found] <E1ERnvt-0006VR-9S@monty-python.gnu.org>
2005-10-18 15:05 ` Jonathan Yavner
2005-10-19 15:59   ` Juri Linkov
2005-10-20  4:54     ` Richard M. Stallman

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).