Re: xml-lite.el

From: Felix Natter <fnatter@gmx.net>
Cc: Mike Williams <mdub@bigfoot.com>,
	emacs-devel@gnu.org, Sam Steingold <sds@gnu.org>,
	 Eli Zaretskii <eliz@is.elta.co.il>,
	keichwa@gmx.net (Karl Eichwalder)
Subject: Re: xml-lite.el
Date: 08 Mar 2002 22:41:58 +0100	[thread overview]
Message-ID: <87pu2e4tzt.fsf@gmx.net> (raw)
In-Reply-To: <200203072345.g27NjKh17379@rum.cs.yale.edu>

"Stefan Monnier" <monnier+gnu/emacs@RUM.cs.yale.edu> writes:
> > hi,
> > 
> > here's what we have to do to spice xml-lite up for sgml:
> > 
> > - allow Different tag-names

The regular expressions in `xml-lite-parse-tag-name' need to be changed
for SGML compatibility.

Maybe Karl can help us out: Which characters are allowed at the beginning
of a (normal) tag, and which characters are allowed for the following
characters ?
We can also change this to name only the characters that may
_not_ appear in the tag name ((re-search-forward "[^ />\"']*" nil t)).

> > - replace (setq tag-end (search-forward ">" limit t))

All occurrences of (search-forward ">") need to be changed and I suggest
to put this in a function on its own so that we can just replace the above
by (sgml-end-of-tag)
(we can use a simple re-search-forward, as suggested below)

Furthermore, I am not sure whether '<' may be used unescaped. If so,
then we need to use (sgml-beginning-of-tag) (which already exists).

In that case, I think it is a good idea to optimize this: rename the current
`sgml-beginning-of-tag' to something like `sgml-get-name-of-tag' and
write a faster `sgml-beginning-of-tag'.

Based on Stefan's regular expression, I tried this:

(defmacro sgml-beginning-of-tag()
  "Skip to beginning of tag."
  (re-search-backward "</?\\([^<\"']\\|'[^']*'\\|\"[^\"]*\"\\)*/?" nil t))

but this will stop at the wrong (quoted) '<' in e.g. '<a href =
"><x.png" border="><0">', probably due to the re-search-_backward_.

I will try continue to try to find the solution.

> > by something like this:
> > ;; exit once unquoted '/' or '>' is found:
> > ;; in HTML (SGML?), unquoted attribute values may only contain
> > ;; [A-Za-z0-9-.]  (section 3.2.2 of the html4 spec);
> > ;; in xml all attribute values are quoted.
> > ;; TODO: maybe this can be done faster with a regular expression ?
> > ;; (something like sgml-start-tag-regexp)
> > (while (not (and (or (char-equal (char-after) ?/)
> > 		     (char-equal (char-after) ?>))
> > 		 (null quote-token)))
> > 
> >   (if (and (char-equal (char-after) ?\")
> > 	   (not (char-equal (char-before) ?\\)))
> >       ;; unescaped ?\"
> >       (cond ((null quote-token)
> > 	     ;; start of quoted content
> > 	     (setq quote-token ?\"))
> > 	    ;; end of quoted content
> > 	    ((char-equal quote-token ?\")
> > 	     (setq quote-token nil))
> > 	    ;; quote-token == ?\' => part of quoted content => ignore
> > 	    ))
> 
> Use forward-sexp to jump over matched ".  It's enormously faster.

I guess the re-search-forward is faster, right ?

> >   (if (and (char-equal (char-after) ?\')
> > 	   (not (char-equal (char-before) ?\\)))
> >       ;; unescaped ?\'
> >       (cond ((null quote-token)
> > 	     ;; start of quoted content
> > 	     (setq quote-token ?\'))
> > 	    ;; end of quoted content
> > 	    ((char-equal quote-token ?\')
> > 	     (setq quote-token nil))
> > 	    ;; quote-token == ?\" => part of quoted content => ignore
> > 	    ))
> >   (forward-char))
> 
> Indeed the above looks like
> 
>   (re-search-forward "\\([^/>\"']\\|'[^']*'\\|\"[^\"]*\"\\)*[/>]" limit t)

I suggest to name this `sgml-end-of-tag' (note that for symmetry with
sgml-beginning-of-tag, this skips up to '>' and allows whitespace
before the '/', as in <br />):

(defmacro sgml-end-of-tag()
  "Skip to end of tag (either '/' or '>')"
  (re-search-forward "\\([^/>\"']\\|'[^']*'\\|\"[^\"]*\"\\)*/?" nil t))

> > - support `sgml-empty-tags': in xml-lite-parse-tag-backward:
> > add code in this block:
> >        (t
> >         (setq tag-type 'open
> >               name (xml-lite-parse-tag-name)
> >               name-end (point))
> >         ;; check whether it's an empty tag
> >         (if (and tag-end (eq ?/ (char-before (- tag-end 1))))
> >             (setq tag-type 'empty)))
> 
> That looks easy enough, indeed.

I use this helper-function which makes the code simpler:

(defun sgml-is-empty-tag-p(tag-name)
  "Return t if tag is in `sgml-empty-tags'."
  (if (null tag-name)
      nil
    (if sgml-xml
	(member tag-name sgml-empty-tags)
      (member-ignore-case tag-name sgml-empty-tags))))

(but you can just as well put the code "inline" because this is
the only place where sgml-is-empty-tag-p is used)

161c161,162
<         (if (and tag-end (eq ?/ (char-before (- tag-end 1))))
---
>         (if (or (and tag-end (eq ?/ (char-before (- tag-end 1))))
> 		(sgml-is-empty-tag-p name))

but this is not yet tested.

> > - support `sgml-unclosed-tags': we cannot get this perfectly right,

first we need this in sgml-mode.el:

(defvar sgml-unclosed-tags nil
  "A list of elements for which the end-tag may be omitted.
In XML these elements should be closed or empty-element tags.
This variable is most useful when used file-locally
\(see C-h i m Emacs RET m Local Variables in Files RET)")

and a helper function:

(defun sgml-is-unclosed-tag-p(tag-name)
  "Return t if tag is in `sgml-unclosed-tags'."
  (if (null tag-name)
      nil
    (if sgml-xml
	(member tag-name sgml-unclosed-tags)
      (member-ignore-case tag-name sgml-unclosed-tags))))

(The check for null came in handy when I wrote my indenter,
so I suggest to include it; we can kick it out when we find that it
is useless)

> > because the dtd controls how missing end-tags are inferred. However,
> > with a set of simple rules you can make most cases work okay (my
> > "absolute" (non-relative) proof-of-concept indenter can now handle
> > unclosed <li> and <dl>'s):
> > 1. the end-tag of an `sgml-unclosed-tag' will be closed before its parent
> >    closed:
> > <ul>
> >    <li>
> > </ul>
> > (</li> kommt direkt vor </ul>)
> 
> That could even be used for all tags (even those that are not
> in sgml-unclosed-tags).

I don't think we should do this just like sgml-close-tag (or
sgml-insert-end-tag) should tell the user when an end-tag is omitted
and it's not legal.

> > 2. an `sgml-unclosed-tag' is closed before another `sgml-unclosed-tag' is
> >    openend (but this rule doesn't support i.e. <li><p>item1</p></li>..., I'll
> >    have to improve this):
> > <ul>
> >   <li>
> >   <li>
> > </ul>
> > (the first <li> is closed before the second is opened)
> > or
> > <dl>
> >  <dt>x
> >  <dd>the dependent variable
> >  <dt>y
> >  <dd>another variable
> > </dl>
> > and so on. Next I will try to support <dl>'s.
> 
> Indeed, I'm not sure what the rule should be.  For sure seeing the same
> tag again implies the previous one is closed (this covers the <li><li>
> case above).  But for mixes of unclosed tags, it's less clear.  Maybe

Yes, the second rule definitely needs to be rethought.

> we can just use some kind of precedence scheme, but I'd rather first see
> how things turn out in practice with a trivial system.

What are you referring to when you say "trivial system" ?

> > But xml-lite has some xml-only features which are nice and is
> > very fast, so I think we might want to keep xml-lite as is and we use
> > sgml-close-tag and a new (slower) relative indenter for sgml-mode.
> 
> I don't think that the changes to make it handle sgml indentation should
> make it noticeably slower.

Okay, something general about xml-lite.el/sgml-mode.el: If we support
jsp/php/asp etc. then the code you be more complicated.

That's why I suggest to tell users who need this to use html-helper-mode
(http://www.gest.unipd.it/~saint/hth.html) instead.

> 	Stefan
> 
> PS: any reason why you took out the rest of the crowd from the Cc line ?
>     I'm pretty sure Mike Williams would like to get copies of our discussion.

done :-)

-- 
Felix Natter

_______________________________________________
Emacs-devel mailing list
Emacs-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/emacs-devel