From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Juri Linkov Newsgroups: gmane.emacs.devel Subject: Handling invalid HTML Date: Tue, 18 Oct 2005 11:06:42 +0300 Organization: JURTA Message-ID: <87br1ni7gl.fsf@jurta.org> NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: sea.gmane.org 1129628082 1781 80.91.229.2 (18 Oct 2005 09:34:42 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Tue, 18 Oct 2005 09:34:42 +0000 (UTC) Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Oct 18 11:34:41 2005 Return-path: Original-Received: from lists.gnu.org ([199.232.76.165]) by ciao.gmane.org with esmtp (Exim 4.43) id 1ERnqw-0003my-NF for ged-emacs-devel@m.gmane.org; Tue, 18 Oct 2005 11:33:50 +0200 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1ERnqv-0005CL-SK for ged-emacs-devel@m.gmane.org; Tue, 18 Oct 2005 05:33:50 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1ERmZj-00033f-1y for emacs-devel@gnu.org; Tue, 18 Oct 2005 04:11:59 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1ERmZc-00032B-99 for emacs-devel@gnu.org; Tue, 18 Oct 2005 04:11:54 -0400 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1ERmZZ-00031D-2N for emacs-devel@gnu.org; Tue, 18 Oct 2005 04:11:49 -0400 Original-Received: from [194.126.101.114] (helo=mail.neti.ee) by monty-python.gnu.org with esmtp (Exim 4.34) id 1ERmZY-0000uB-T8 for emacs-devel@gnu.org; Tue, 18 Oct 2005 04:11:49 -0400 Original-Received: from mail.neti.ee (80-235-32-236-dsl.mus.estpak.ee [80.235.32.236]) by Relayhost1.neti.ee (Postfix) with ESMTP id B6B403841 for ; Tue, 18 Oct 2005 11:12:01 +0300 (EEST) Original-To: emacs-devel@gnu.org User-Agent: Gnus/5.110004 (No Gnus v0.4) Emacs/22.0.50 (gnu/linux) X-Virus-Scanned: by amavisd-new-2.2.1 (20041222) (Debian) at neti.ee X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:44244 Archived-At: Current rules of recognizing HTML files in Emacs are too strict: 1. The valid string delimiter for HTML attribute values is the quotation character. However, some HTML files on the Web use apostrophes, e.g. The program that generates such non-standard meta headers is identified as 'Microsoft DHTML Editing Control' (no surprise). `sgml-html-meta-auto-coding-function' can't determine encoding from such invalid meta headers. I propose to replace \" with [\"'] in regexps in `sgml-html-meta-auto-coding-function' to accept such invalid HTML. (The regexps in other function `sgml-xml-auto-coding-function' already match [\"'] for XML files). 2. `sgml-html-meta-auto-coding-function' can't determine encoding when HTML file has no `' starting element. An example of such HTML file is the Mozilla Firefox bookmark file. Sometimes it's needed to open this file in Emacs and to use isearch on it, but Emacs can't detect its encoding. Perhaps the test `(search-forward " Newsgroups: gmane.emacs.devel Subject: Handling invalid HTML Date: Tue, 18 Oct 2005 11:05:55 -0400 Message-ID: <200510181105.56063.jyavner@member.fsf.org> References: NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-Trace: sea.gmane.org 1129648142 1458 80.91.229.2 (18 Oct 2005 15:09:02 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Tue, 18 Oct 2005 15:09:02 +0000 (UTC) Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Oct 18 17:09:00 2005 Return-path: Original-Received: from lists.gnu.org ([199.232.76.165]) by ciao.gmane.org with esmtp (Exim 4.43) id 1ERt2d-0003Vu-RH for ged-emacs-devel@m.gmane.org; Tue, 18 Oct 2005 17:06:16 +0200 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1ERt2c-0006xY-SK for ged-emacs-devel@m.gmane.org; Tue, 18 Oct 2005 11:06:14 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1ERt2R-0006xI-Ll for emacs-devel@gnu.org; Tue, 18 Oct 2005 11:06:03 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1ERt2Q-0006x5-07 for emacs-devel@gnu.org; Tue, 18 Oct 2005 11:06:03 -0400 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1ERt2P-0006x2-SN for emacs-devel@gnu.org; Tue, 18 Oct 2005 11:06:01 -0400 Original-Received: from [204.127.198.39] (helo=rwcrmhc12.comcast.net) by monty-python.gnu.org with esmtp (Exim 4.34) id 1ERt2P-0002u5-MS for emacs-devel@gnu.org; Tue, 18 Oct 2005 11:06:01 -0400 Original-Received: from [192.168.0.254] (pcp109868pcs.wchryh01.nj.comcast.net[68.45.81.14]) by comcast.net (rwcrmhc13) with ESMTP id <20051018150558015009mbcme>; Tue, 18 Oct 2005 15:06:00 +0000 Original-To: emacs-devel@gnu.org User-Agent: KMail/1.6.2 In-Reply-To: Content-Disposition: inline X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:44267 Archived-At: Juri Linkov writes: > 1. The valid string delimiter for HTML attribute values is the > quotation character. However, some HTML files on the Web use > apostrophes, e.g. > > The program that generates such non-standard meta headers is > identified as 'Microsoft DHTML Editing Control' (no surprise). http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.2.2 "By default, SGML requires that all attribute values be delimited using either double quotation marks (ASCII decimal 34) or single quotation marks (ASCII decimal 39). ... In certain cases, authors may specify the value of an attribute without any quotation marks." In XHTML the no-marks case was eliminated, but the use of 'apostrophes' is still valid. There are many complaints one can make about Microsoft, but this isn't one of them. --Jonathan From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: "Richard M. Stallman" Newsgroups: gmane.emacs.devel Subject: Re: Handling invalid HTML Date: Tue, 18 Oct 2005 22:43:57 -0400 Message-ID: References: <87br1ni7gl.fsf@jurta.org> Reply-To: rms@gnu.org NNTP-Posting-Host: main.gmane.org Content-Type: text/plain; charset=ISO-8859-15 X-Trace: sea.gmane.org 1129690017 28734 80.91.229.2 (19 Oct 2005 02:46:57 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Wed, 19 Oct 2005 02:46:57 +0000 (UTC) Cc: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Oct 19 04:46:55 2005 Return-path: Original-Received: from lists.gnu.org ([199.232.76.165]) by ciao.gmane.org with esmtp (Exim 4.43) id 1ES3yL-0002jy-UZ for ged-emacs-devel@m.gmane.org; Wed, 19 Oct 2005 04:46:34 +0200 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1ES3yL-0003Oc-FM for ged-emacs-devel@m.gmane.org; Tue, 18 Oct 2005 22:46:33 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1ES3vr-0002UN-B5 for emacs-devel@gnu.org; Tue, 18 Oct 2005 22:43:59 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1ES3vq-0002U2-P3 for emacs-devel@gnu.org; Tue, 18 Oct 2005 22:43:58 -0400 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1ES3vq-0002Tx-Hz for emacs-devel@gnu.org; Tue, 18 Oct 2005 22:43:58 -0400 Original-Received: from [199.232.76.164] (helo=fencepost.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.34) id 1ES3vq-0002BJ-PM for emacs-devel@gnu.org; Tue, 18 Oct 2005 22:43:58 -0400 Original-Received: from rms by fencepost.gnu.org with local (Exim 4.34) id 1ES3vp-0006ku-Py; Tue, 18 Oct 2005 22:43:57 -0400 Original-To: Juri Linkov In-reply-to: <87br1ni7gl.fsf@jurta.org> (message from Juri Linkov on Tue, 18 Oct 2005 11:06:42 +0300) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:44292 Archived-At: 3. Visiting Mozilla Firefox bookmark file in Emacs also can't detect the type of this file. Emacs opens it in SGML mode whereas it is actually HTML file. This problem is caused by the default value of `magic-mode-alist'. Maybe the `.html' extension in `auto-mode-alist' should take precedence over `magic-mode-alist'? That would be the tail wagging the dog. If there is a suitable criterion to test, that would give the right results, whatever function is called through magic-mode-alist can test that criterion. From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Juri Linkov Newsgroups: gmane.emacs.devel Subject: Re: Handling invalid HTML Date: Wed, 19 Oct 2005 18:59:14 +0300 Organization: JURTA Message-ID: <877jc9h5hi.fsf@jurta.org> References: <200510181105.56063.jyavner@member.fsf.org> NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: sea.gmane.org 1129742157 28145 80.91.229.2 (19 Oct 2005 17:15:57 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Wed, 19 Oct 2005 17:15:57 +0000 (UTC) Cc: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Oct 19 19:15:55 2005 Return-path: Original-Received: from lists.gnu.org ([199.232.76.165]) by ciao.gmane.org with esmtp (Exim 4.43) id 1ESHWU-0005b4-Jw for ged-emacs-devel@m.gmane.org; Wed, 19 Oct 2005 19:14:42 +0200 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1ESHWT-0003qF-Sj for ged-emacs-devel@m.gmane.org; Wed, 19 Oct 2005 13:14:41 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1ESGox-0005RJ-Md for emacs-devel@gnu.org; Wed, 19 Oct 2005 12:29:43 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1ESGow-0005Qm-HL for emacs-devel@gnu.org; Wed, 19 Oct 2005 12:29:42 -0400 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1ESGow-0005Qj-EJ for emacs-devel@gnu.org; Wed, 19 Oct 2005 12:29:42 -0400 Original-Received: from [194.126.101.114] (helo=mail.neti.ee) by monty-python.gnu.org with esmtp (Exim 4.34) id 1ESGow-0006yr-9p for emacs-devel@gnu.org; Wed, 19 Oct 2005 12:29:42 -0400 Original-Received: from mail.neti.ee (80-235-35-28-dsl.mus.estpak.ee [80.235.35.28]) by Relayhost1.neti.ee (Postfix) with ESMTP id 8D7E81FC1; Wed, 19 Oct 2005 19:29:54 +0300 (EEST) Original-To: Jonathan Yavner In-Reply-To: <200510181105.56063.jyavner@member.fsf.org> (Jonathan Yavner's message of "Tue, 18 Oct 2005 11:05:55 -0400") User-Agent: Gnus/5.110004 (No Gnus v0.4) Emacs/22.0.50 (gnu/linux) X-Virus-Scanned: by amavisd-new-2.2.1 (20041222) (Debian) at neti.ee X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:44330 Archived-At: > http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.2.2 > "By default, SGML requires that all attribute values be delimited > using either double quotation marks (ASCII decimal 34) or single > quotation marks (ASCII decimal 39). ... In certain cases, authors > may specify the value of an attribute without any quotation marks." > > In XHTML the no-marks case was eliminated, but the use of 'apostrophes' > is still valid. There are many complaints one can make about > Microsoft, but this isn't one of them. I still see no reason for them to generate HTML files with such an uncommon syntax, if not for making the life of users harder. Anyway, the following patch will allow Emacs to recognize encoding with either quotation marks (and for the attribute `content-type' quotation marks are optional): Index: lisp/international/mule.el =================================================================== RCS file: /cvsroot/emacs/emacs/lisp/international/mule.el,v retrieving revision 1.226 diff -c -r1.226 mule.el *** lisp/international/mule.el 24 Sep 2005 13:43:59 -0000 1.226 --- lisp/international/mule.el 19 Oct 2005 15:57:28 -0000 *************** *** 2229,2242 **** (save-excursion (forward-line 10) (point)))) ! (when (and (search-forward " Newsgroups: gmane.emacs.devel Subject: Re: Handling invalid HTML Date: Thu, 20 Oct 2005 00:54:32 -0400 Message-ID: References: <200510181105.56063.jyavner@member.fsf.org> <877jc9h5hi.fsf@jurta.org> Reply-To: rms@gnu.org NNTP-Posting-Host: main.gmane.org Content-Type: text/plain; charset=ISO-8859-15 X-Trace: sea.gmane.org 1129784650 23872 80.91.229.2 (20 Oct 2005 05:04:10 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Thu, 20 Oct 2005 05:04:10 +0000 (UTC) Cc: jyavner@member.fsf.org, emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Oct 20 07:04:02 2005 Return-path: Original-Received: from lists.gnu.org ([199.232.76.165]) by ciao.gmane.org with esmtp (Exim 4.43) id 1ESSZ7-00074y-7x for ged-emacs-devel@m.gmane.org; Thu, 20 Oct 2005 07:02:09 +0200 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1ESSZ6-0002au-Md for ged-emacs-devel@m.gmane.org; Thu, 20 Oct 2005 01:02:08 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1ESSWu-0001mg-92 for emacs-devel@gnu.org; Thu, 20 Oct 2005 00:59:52 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1ESSRo-0000Yn-TZ for emacs-devel@gnu.org; Thu, 20 Oct 2005 00:54:37 -0400 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1ESSRm-0000XY-HI for emacs-devel@gnu.org; Thu, 20 Oct 2005 00:54:34 -0400 Original-Received: from [199.232.76.164] (helo=fencepost.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.34) id 1ESSRl-0004cR-Sj; Thu, 20 Oct 2005 00:54:33 -0400 Original-Received: from rms by fencepost.gnu.org with local (Exim 4.34) id 1ESSRk-0006JV-4A; Thu, 20 Oct 2005 00:54:32 -0400 Original-To: Juri Linkov In-reply-to: <877jc9h5hi.fsf@jurta.org> (message from Juri Linkov on Wed, 19 Oct 2005 18:59:14 +0300) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:44382 Archived-At: Anyway, the following patch will allow Emacs to recognize encoding with either quotation marks (and for the attribute `content-type' quotation marks are optional): If no one objects in the next week, would you please install it?